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ABSTRACT 

We describe the application of non-negative matrix factorisation to generate compact 
reconstructions of quasar spectra from the Sloan Digital Sky Survey (SDSS), with 
particular reference to broad absorption line quasars (BALQSOs). BAL properties 
are measured for Si IV A1400, Civ A1550, Aim A1860 and Mgn A2800, resulting 
in a catalogue of 3547 BALQSOs. Two corrections, based on extensive testing of 
synthetic BALQSO spectra, are applied in order to estimate the intrinsic fraction 
of Civ BALQSOs. First, the probability of an observed BALQSO spectrum being 
identified as such by our algorithm is calculated as a function of redshift, signal-to-noise 
ratio and BAL properties. Second, the different completenesses of the SDSS target 
selection algorithm for BALQSOs and non-BAL quasars are quantified. Accounting 
for these selection effects the intrinsic C iv BALQSO fraction is 41±5 per cent. Our 
analysis of the selection effects allows us to measure the dependence of the intrinsic 
C iv BALQSO fraction on luminosity and redshift. We find a factor of 3.5±0.4 decrease 
in the intrinsic fraction from the highest redshifts, z~4.0, down to z~2.0. The redshift 
dependence implies that an orientation effect alone is not sufficient to explain the 
presence of BAL troughs in some but not all quasar spectra. Our results are consistent 
with the intrinsic BALQSO fraction having no strong luminosity dependence, although 
with 3-ct limits on the rate of change of the intrinsic fraction with luminosity of —6.9 
and 7.0 per cent dex -1 we are unable to rule out such a dependence. 

Key words: methods: data analysis - galaxies: active - galaxies: nuclei - quasars: 
absorption lines - galaxies: statistics 



1 INTRODUCTION 

Despite the greatly increased quantity and quality of data 
since the first observations of broad absorption line quasars 
(BALQSOs; see Weymann, Carswell & Smith 1981 for a re- 
view), relatively little is known about their physical origins. 
By definition, a BALQSO exhibits blueshifted absorption, 
which may extend to tens of thousands of kilometres per 
second relative to the quasar systemic velocity, with an ab- 
sorption velocity width of at least 2000 km s -1 (Weymann 
et al. 1981). Such broad absorption features with compa- 
rably large redshifted velocities are not observed, implying 
that the absorption originates in material outflowing from 
the quasar. 

It is not yet clear whether the material outflowing in 



BAL systems represents a large mass relative to outflow 
processes as a whole, with estimates of the total column 
density of absorbing material varying by several orders of 
magnitude (Hamann 1998; Gabel, Arav & Kim 2006; Case- 
beer et al. 2008; Hamann et al. 2008). However, the absorp- 
tion can certainly be used to trace the kinematics of the 
quasar outflows. As outflows are believed to be fundamental 
to the phenomenon of active galactic nuclei (AGN) feed- 
back a better understanding of their kinematics could shed 
light on the physical basis of the outflow phenomenon (see 
Begelman 2004 for a review of AGN feedback mechanisms) . 
In some models the BAL phenomenon arises from preferred 
viewing angles towards the central regions and the demo- 
graphics of BALQSOs among the quasar population may 
also help constrain the parameters of unified models (e.g. 
Elvis 2000). 
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Broad absorption is most commonly observed in high- 
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ionisation lines such as Civ A1550 and Siiv A1400; BALQ- 
SOs without absorption present in lower-ionisation species 
are termed HiBALQSOs. Less frequently, broad absorption 
is also observed in the low-ionisation lines Aim A1860 and 
Mg II A2800. BALQSOs with both high- and low-ionisation 
absorption are termed LoBALQSOs, while those even rarer 
objects that also show absorption by one or more species 
of iron are termed FeLoBALQSOs. See Hall et al. (2002) 
for examples of the rich variety of spectra produced by the 
broad absorption phenomenon. 

The most widely used metric to separate BALQSOs and 
non-BAL quasars is the balnicity index (BI), first presented 
by Weymann et al. (1991). The BI is defined as 



BI 



m 

0.9 



Cdv, 



(1) 



where f(v) is the continuum-normalised flux as a function 
of velocity, v, relative to the line centre. The constant C is 
equal to unity in regions where f(v) has been continuously 
less than 0.9 for at least 2000 kms -1 , counting from large 
to small outflow velocities, and zero elsewhere. The outflow 
velocity, v, is defined to be negative for blueshift velocities, 
i.e. for material moving towards the observer. Quasars with 
non-zero BI are defined as BALQSOs; all others are non- 
BAL quasars. 

Modifications to the BI have been proposed, in partic- 
ular the absorption index (Al; Hall et al. 2002; Trump et al. 
2006), designed to include narrower troughs than the BI, 
and the modified balnicity index Bio (Gibson et al. 2009, 
hereafter G09), which extends the integration region to zero 
velocity. Each of these modifications increases the number of 
quasars defined to be BALQSOs, but Knigge et al. (2008) 
demonstrated that the observed distribution of Al values 
is bimodal, suggesting that the metric encompasses two dis- 
tinct populations of absorbers. Except where noted we define 
BALQSOs in this paper according to the traditional BI. 

The observed Civ BALQSO fraction has been variously 
calculated as, among other measurements, 15±3 (Hewett & 
Foltz 2003), 14.0L1.0 (Reichard et al. 2003b), 12.5 (Scaringi 
et al. 2009), and 13.3L0.6 per cent (G09). Correcting for the 
different probabilities of a BALQSO and non-BAL quasars 
entering the spectroscopic surveys used, the intrinsic frac- 
tion present in flux-limited optical surveys has been esti- 
mated as 22±4 (Hewett & Foltz 2003), 15.9L1.4 (Reichard 
et al. 2003b), 17±3 (Knigge et al. 2008), and 16.4L0.6 per 
cent (G09). 

It is often suggested that broad absorption occurs in 
all quasars but is only observed along particular sightlines 
(e.g. Weymann et al. 1991; Elvis 2000). In this model the 
BALQSO fraction can be directly interpreted as the solid 
angle covered by the absorbing clouds divided by the solid 
angle over which a Type 1 quasar can be seen. Another pos- 
sibility is that BALQSOs could represent a particular evo- 
lutionary stage (e.g. Voit, Weymann & Korista 1993; Becker 
et al. 1997; Lipari &: Terlevich 2006), during which absorb- 
ing material with a high covering fraction is being expelled 
from the central regions of the quasar. However, models in 
which individual BALQSOs have a very high covering frac- 
tion appear to be ruled out by the results of Gallagher et al. 
(2007), which show that BAL and non-BAL quasars have 
very similar mid-infrared properties, while a high covering 
fraction would result in more light being reprocessed into 



the mid-infrared. The possibility still remains that a com- 
bination of evolutionary and orientation effects can explain 
the separation of BALQSOs and non-BAL quasars. For ex- 
ample, in a disc wind model the structure of the wind could 
change with cosmic time, while still having an orientation 
dependence. 

It is difficult or impossible to distinguish between the 
different models based on observations of individual objects, 
but much progress can be made by characterising the statis- 
tical properties of the BALQSO population. Crucial prop- 
erties in such an investigation are the dependence of the 
BALQSO fraction on factors such as redshift and luminos- 
ity. However, measuring these properties is greatly compli- 
cated by the selection effects involved, which are themselves 
strongly dependent on redshift and luminosity, in part due 
to their correlations with the signal-to-noise ratio (S/N) of 
observed spectra. For example, the probability of correctly 
identifying a BALQSO increases with increasing S/N, which 
is in turn related to the luminosity of an observed quasar. 
Without quantifying the form of the selection effect it is im- 
possible to establish whether or not an observed trend in the 
BALQSO fraction with luminosity is the result of an intrin- 
sic trend. The existence of such an effect has previously been 
noted (Knigge et al. 2008; G09) but not fully addressed; we 
quantify the S/N-dependent selection probability and other 
selection effects in this paper. 

A pre-requisite for any investigation of the statisti- 
cal distribution of absorption properties is a large well- 
defined catalogue of BALQSOs. The Sloan Digital Sky Sur- 
vey (SDSS; York et al. 2000) is very well suited to this pur- 
pose as it provides a large homogeneous sample of quasar 
spectra in which to search for BALQSOs. Previous studies 
producing BALQSO catalogues from the SDSS include Re- 
ichard et al. (2003a), using the Early Data Release (EDR), 
Trump et al. (2006), using the Third Data Release (DR3), 
and G09 and Scaringi et al. (2009), both using DR5. In this 
work, quasars from the SDSS DR6 are used, giving a larger 
sample size. 

The BI, Al and Bio are all calculated from the 
continuum-normalised flux, so all require an estimate of the 
unabsorbed continuum level. Improving the quality of the 
continua will greatly improve the accuracy of the resulting 
determinations of statistical properties, as the continuum 
level is currently the principal uncertainty in the classifi- 
cation of BALQSOs. Previous studies have estimated the 
continuum using template spectra (Trump et al. 2006) or 
simple continuum plus emission- line models (G09), but each 
of these techniques has its disadvantages: template spectra 
are limited in the range of continua and emission- line profiles 
they allow, while any technique that relies on directly fitting 
the emission lines will struggle when significant sections of 
those emission lines are absorbed. 

In this paper we produce estimates of the unabsorbed 
emission using non-negative matrix factorisation (NMF; Lee 
& Seung 1999, 2000; Blanton & Roweis 2007), a blind source 
separation technique. NMF uses all the available informa- 
tion to fit the entire spectrum simultaneously, enabling ac- 
curate reconstructions to be produced even in cases where 
much of the spectrum has been absorbed. Starting from a 
suitably chosen input sample the technique is able to recon- 
struct spectra with a wide range of observed properties with 
little or no manual intervention. The estimates of the con- 
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tinua in the C IV BI region will be made publicly available 
through the SDSS value-added catalogues as this paper is 
published, to enable their use in any future BALQSO clas- 
sification schemes. 

In Section 2 we introduce blind source separation and 
non-negative matrix factorisation. In Section 3 we present 
our sample of SDSS quasars, and in Section 4 we describe 
our method of reconstructing their spectra. The results are 
presented in Section 5, including the resulting catalogue of 
BALQSOs. In Section 6 we compare these results to those 
from previous studies. In Sections 7 and 8 we quantify the 
probability of a BAL trough being detected by our meth- 
ods, and the relative probabilities of BALQSOs and non- 
BAL quasars entering the SDSS spectroscopic survey. From 
these results the intrinsic BALQSO fraction is calculated in 
Section 9. The results are discussed in Section 10, and we 
summarise our conclusions in Section 11. 

Those readers who are particularly interested in the 
BALQSO catalogue itself, its generation and the observed 
properties of BALQSOs should focus on Sections 3-5. Read- 
ers who are more interested in the relation between the ob- 
served and intrinsic properties of the BALQSO population, 
and in particular the calculation of the intrinsic fraction of 
BALQSOs as a function of redshift and luminosity, can skip 
to Section 7. 

In this work we assume a flat ACDM cosmology with 
iJo=70kms~ 1 Mpc~ , $!m=0.3 and Oa=0.7. Vacuum wave- 
lengths are employed throughout the paper. 



2 BLIND SOURCE SEPARATION 

Blind source separation (BSS) techniques involve rewriting 
aimxm data matrix V as the product of a set of compo- 
nents, H, and weights, W: 



V = WH. 



(2) 



In the context of this work, V is an n x m array of flux 
measurements for n different quasars at m wavelengths, H 
is an r x m array of the r component spectra at the same 
wavelengths, and W is an n x r array of the corresponding 
weights for each quasar. For any individual quasar, the ob- 
served spectrum is written as a linear combination of the r 
components. In the case that r < n, the equality in equa- 
tion 2 is an approximation, and the product WH can be 
viewed as a reconstruction of the original data. It is this re- 
construction that we use as an estimate of the unabsorbed 
continuum of BALQSOs. 



2.1 Non- negative matrix factorisation 

Non-negative matrix factorisation (NMF) is a relatively new 
BSS technique that incorporates a non-negativity constraint 
on both its components and their weights (Lee & Seung 
1999, 2000; Blanton & Roweis 2007). The non- negativity 
constraint is appealing in the context of spectroscopic data 
as the physical emission signatures are expected to obey 
such a restriction naturally. Unusually for a BSS technique, 
fewer components are generated than there are input spec- 
tra. Starting from random initial matrices, the components, 



H, and weights, W , follow the multiplicative update rules 
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in which, at each step, the elements of W and H are replaced 
by the values given by the right hand sides of equations 3 
and 4. The matrices continue to be updated until a stable 
solution is reached. 

The update rules minimize the error in the reconstruc- 
tions, WH, as measured by the Euclidean distance to the 
data, defined as 



V -WH 11 = 



\ 



ij V k 



i\ 



(5) 



The random starting conditions result in slightly different 
components being generated each time the algorithm is ex- 
ecuted. However, as the Euclidean distance is defined in 
terms of the reconstructions, and the algorithm converges 
on a minimum in the Euclidean distance (Lee & Seung 1999, 
2000), the resulting reconstructions do not vary. 

After an initial source separation has been performed, 
generating a set of components, the components can be ap- 
plied to create compact reconstructions of further spectra. 
A new data matrix is created from the spectra that are to 
be reconstructed, the components matrix is taken from the 
initial results, and a random set of starting weights is gen- 
erated. The matrix of weights is then updated according 
to equation 3, keeping the components matrix fixed. This 
process finds a local minimum in the Euclidean distance be- 
tween the data and the reconstructions, under the constraint 
of fixed H . 



3 QUASAR SAMPLE 

The quasar sample consists of 91 665 objects from the SDSS 
DR6 spectroscopic survey, including 77 392 quasars in the 
Schneider et al. (2007) DR5 quasar catalogue that are re- 
tained in the later DR7 quasar catalogue of Schneider et al. 
(2010). A further 13 081 objects are quasars, present in the 
additional DR6 spectroscopic plates, identified by one of 
us (PCH) using a similar prescription to that employed 
by Schneider et al. (2007), all of which are present in the 
Schneider et al. (2010) catalogue. An additional 1192 ob- 
jects, which do not satisfy one, or both, of the emission line 
velocity width or absolute magnitude criterion imposed by 
Schneider et al. (2007), are also included. While formally 
failing the 'quasar' definition of Schneider et al.'s DR5 and 
DR7 compilations the objects are essentially all luminous 
active galactic nuclei (AGN). In fact, the vast majority of 
such objects lie at redshifts z<0.4, where even Mgll broad 
absorption is not visible in the SDSS spectra. In the fol- 
lowing work, objects with z<0.4 were not searched for BAL 
features, and a further 30 spectra were discarded because 
they were very poor quality or unavailable in DR6, leaving 
93 400 spectra of 86 773 unique objects. 

The spectra were all processed through the sky-residual 
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subtraction scheme of Wild & Hewett (2005), resulting 
in significantly improved S/N at observed wavelengths 
A>7200A. 

The SDSS DR6 contains a very large number of ob- 
jects for which multiple spectra are available. For our quasar 
sample there are ~9000 independent pairs of spectra. The 
catalogue of spectrum pairs allows an empirical check on 
the reproducibility of the BI determinations as a function of 
spectrum S/N. 

Improved quasar redshifts were taken from a prelimi- 
nary implementation of the Hewett & Wild (2010) scheme 
for generating self-consistent redshifts for quasars with SDSS 
spectra. While not responsible for major differences in the 
resulting catalogue of BALQSOs, the large number of red- 
shift revisions at the ~ 1000 km s -1 level does result in clas- 
sification changes for a number of quasars with absorp- 
tion present at low velocities, <5000kms _1 , relative to the 
quasar systemic redshift. 

The SDSS spectra as provided are not corrected for the 
effects of Galactic extinction, so before use we applied a 
correction using the SDSS E(B — V) measurement, taken 
from the dust maps of Schlegel, Finkbeiner & Davis (1998), 
and the Milky Way extinction curve of Cardelli, Clayton & 
Mathis (1989). 



4 CONTINUUM RECONSTRUCTIONS OF 
QUASARS 

In order to measure the BI, or any related metric, an esti- 
mate of the unabsorbed emission from the quasar is required. 
Here, non-negative matrix factorisation (NMF) is used to 
create sets of component spectra from non-BAL quasars. 
These components are then used to reconstruct the emission 
for all quasars, thereby allowing the implementation of a BI- 
determination algorithm to define the sample of BALQSOs. 
The implementation of the method is described in detail in 
the following sections. 

For any set of spectra, NMF is best applied to the com- 
mon rest-frame wavelength range. To maximise the possi- 
ble wavelength range, and to allow for redshift evolution 
of quasar properties, the quasars in the sample defined in 
Section 3 were divided into non-overlapping redshift bins of 
width Az=0.1. Except where noted, the methods described 
in this section were applied separately to each such redshift 
bin. 



4.1 Quasar spectrum selection for component 
generation 

To generate component spectra a sample of 500 input 
quasars were selected in each redshift bin. The input spec- 
tra were chosen to possess spectrum S/Ns, specifically, 
(SN_R+SN_l)/2, within the restricted range 9-25 and to pos- 
sess at least 3800 'good' SDSS pixels. Quasars exhibiting 
the presence of any form of extended absorption in their 
spectra longward of Lya A1216 were then excluded. A sim- 
ple, very conservative, algorithm, based on the 'bending' 
of absorption-free template spectra to 'fit' each spectrum, 
allowed the calculation of a generic 'absorption equivalent 
width'. The S/N and absorption cuts removed virtually all 
heavily-reddened quasars, but some additional objects in 



the redshift bin 0.7<z<0.8, with strong [Om] A5007 emis- 
sion and heavily dust-reddened continua, were manually re- 
moved. The form of the NMF components generated from 
the input quasar sample did not depend on the exact defi- 
nition of the sample of input spectra. 

The spectra at high redshifts were truncated at 1175 A, 
below which wavelength the spectra are dominated by the 
Lya forest, severely attenuating the underlying quasar emis- 
sion. At redshifts z>2.1 fewer than 500 quasars in each red- 
shift bin satisfied the selection requirements, so smaller in- 
put samples were used, reaching a minimum of 202 quasars 
for 2.5^«<2.6. For 2^2.6 there were insufficient quasars to 
produce representative components. However, due to the im- 
position of the truncation at 1175 A, the components gener- 
ated from the 2.5^«<2.6 interval covered the full accessible 
wavelength range present in quasars with z>2.6. Thus, the 
same set of NMF components were used for all quasars with 
z>2.5. 

In each redshift bin, NMF components were generated 
from the input quasar spectra after pixels affected by any 
narrow absorption systems present had been flagged using 
a median filter-based algorithm. Flux values for such pixels, 
along with those pixels flagged as 'bad' due to a value of zero 
in the SDSS spectrum noise array, were generated via inter- 
polation. Similarly, 12.0-A regions centred on the prominent 
sky lines at observed-frame wavelengths 5578.5 and 6301.7 A 
were interpolated over. Note that the use of spectra with a 
very high fraction of 'good' pixels ensured that interpola- 
tion was applied for, at most, only a few, narrow, intervals 
in each spectrum. The spectra were then normalised to have 
the same median flux. 



4.2 Number of components 

In applying NMF the number of components to generate 
must be chosen manually. In all cases using more compo- 
nents will reduce the total Euclidean distance between the 
reconstructions and the data, but above a certain number 
the improvement comes from fitting to the noise of individ- 
ual spectra rather than to the general emission properties 
of the quasars, giving no further improvement in the quality 
of subsequent reconstructions. An estimate of the optimal 
number of components to employ can be found via the xi 
values of the reconstructions of the input spectra: when too 
many components are used and the NMF procedure is repo- 
ducing the noise from just one or a few spectra, the 'overfit- 
ted' spectra have much lower xt values than any others in 
the input sample. 

The point at which overfitting occurs is a function of 
both the S/N and the emission properties of the input spec- 
tra. The number of components to use was chosen as the 
greatest number for which no input spectra showed the drop 
in Xv typical of overfitting. In practice the correct number 
of components could be selected from a visual inspection of 
the xl distributions. The number of components so identi- 
fied varied between 8 and 14 depending on the redshift bin. 

In order to reconstruct a small fraction of the quasar 
spectra the number of components was reduced below the 
number initially determined. These cases are described in 
Sections 4.6 and 4.7. 
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Figure 1. Illustration of the automatic masking procedure. The 
upper panel shows the initial masked regions (grey shading) and 
the resulting NMF reconstruction (red line) of the observed flux 
(black line). The lower panel shows the updated mask derived 
from a comparison of the initial reconstruction and the data. 
The new reconstruction, derived using the updated mask, is also 
shown. The object illustrated is SDSS J080559. 94+140530.4. 



4.3 Automated masking of spectra 

The sets of components, derived as described above, were 
then applied to generate reconstructions of all quasars in the 
sample, following the procedure described in Section 2.1. As 
reconstructions of the unabsorbed emission were required, 
all wavelength regions where absorption was present were 
masked during the fits, along with regions affected by promi- 
nent sky lines. The same recipe for identifying narrow ab- 
sorption and sky lines described in Section 4.1 was applied, 
but the exclusion of wavelength regions affected by broad ab- 
sorption required an iteratively updated mask, as described 
below. 

As an initial mask the wavelength regions A^1240, 
1295^A<1400, 1430<A<1546, and 1780^AsC1880 (all in A) 
were removed, and a reconstruction was generated based on 
the pixels remaining. For subsequent iterations the mask was 
re-defined by examining the data and reconstructions within 
a window of width 31 pixels centred on each pixel in turn: 
a pixel was masked if the majority of pixels in its window 
had an observed flux lower than the reconstruction by at 
least twice the noise level. Any wavelength regions masked 
in this way were extended by a radius of 10 pixels to ensure 
the wings of the absorption were fully covered. Information 
about the locations of previous masks was not used in gener- 
ating new masks and no restrictions were placed on potential 
mask locations. In most cases the mask locations converged 
in only a few iterations. The masking process is illustrated 
in Fig. 1, which shows the initial mask and the mask after 
the first iteration for an example BALQSO. 



4.4 Accounting for quasar SED changes due to 
dust absorption 

NMF, like most BSS techniques, assumes the observed data 
is a linear sum of various components. In contrast, the ex- 
tinction and reddening of light by intervening dust, either 
within the quasar's host galaxy or in an intervening ab- 



sorber, has the effect of multiplying the observed spectrum 
by a wavelength-dependent factor. 

For moderate levels of dust this effect is accounted for 
within the NMF components, as these were generated from 
quasar spectra that were themselves subject to varying levels 
of dust absorption. However, for more strongly reddened ob- 
jects, exhibiting dramatically steeper spectral energy distri- 
butions (SEDs), an empirical correction was applied to esti- 
mate the unreddened spectrum. The method is illustrated in 
Fig. 2. A composite quasar spectrum was generated in each 
redshift bin from all quasars that satisfied the input criteria 
listed in Section 4.1. The ratio of each observed spectrum 
to its corresponding composite spectrum was taken, and a 
power law was fitted to the ratio to determine the level of 
reddening. Prominent emission lines were masked out during 
the fit. If the index of the best-fitting power law was greater 
than 1.5, indicating the observed spectrum was significantly 
redder than the composite, the observed spectrum was di- 
vided by the best-fitting power law to give the observed 
spectrum a similar shape to the composite. An NMF fit 
was then generated for the empirically 'de-reddened' quasar 
spectrum, and the fit was then multiplied by the best-fitting 
power law to match the original observed spectrum . 

It should be emphasized that the procedure is in no 
way designed to parametrize, or otherwise quantify, the ef- 
fect of dust on the quasars. Rather, the procedure has been 
crafted simply to allow effective NMF reconstructions of 
quasars that possess very different SED shapes compared 
to the bulk of the population. The success of the power-law 
normalisation can be attributed to the typical reddening to- 
wards SDSS quasars being very close to a power law in form 
(Hopkins et al. 2004; Maddox et al. 2010, in preparation). 



4.5 Maximising C IV BAL trough coverage 

In taking the common wavelength range of a set of spectra 
spread over a non-zero redshift range, some information from 
the ends of each observed spectrum is necessarily discarded. 
For quasars with z~1.6 the C IV BAL trough at the blue end 
of the observed SDSS spectrum can be lost as a result. To 
extend the ability to identify BALQSOs to the lowest pos- 
sible redshifts the NMF reconstructions of all quasars with 
1.5^z<1.7 were derived using the components from the next 
highest redshift bins, i.e. quasars with 1.5^z<1.6 used the 
components from 1.6^z<1.7, and quasars with 1.6^2<1.7 
used the components from 1.7^z<1.8. 

Using higher-redshift components ensured maximal 
coverage of the C IV regions at the cost of discarding a 
greater portion of the red end of the observed spectra. A 
similar scheme using lower-redshift components to cover a 
BAL trough at the red end of a spectrum was not required 
as the highest-redshift set of components used (2.5^z<2.6) 
had complete coverage of the C IV BAL region. 

An example BALQSO spectrum with 2=1.64 is shown 
in Fig. 3, which shows how the BAL trough is only identified 
when the higher-redshift components are used. The ability 



1 For redshifts z<0.6 the de-reddening procedure was not used as 
the observed quasar spectra show a significantly greater spread in 
power law slopes, prohibiting a clean separation of red outliers. 
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Figure 2. Top panel: observed flux (black) and initial NMF re- 
construction (green) of SDSS J145907.19+002401.2, a quasar at 
2=3.04. The quality of the fit is poor because the quasar is heav- 
ily reddened. Middle panel: flux (black) of the same quasar after 
being divided by a power law slope with index 2.13, and the re- 
sulting NMF reconstruction (red). Bottom panel: as middle panel, 
but after the flux and reconstruction have been multiplied by the 
same power law slope to restore the true shape of the observed 
quasar spectrum. 
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Figure 3. SDSS J143735. 50+222340. 3, a BALQSO at z = 1.64. 
In each panel the thin black line is the observed flux and the 
thick red line is the NMF reconstruction. The NMF components 
for 1.6^z<1.7 do not extend below a wavelength of 1464 A, so the 
reconstruction does not cover the BAL trough (left panel). Us- 
ing the NMF components derived from 1.7^z<1.8 quasars (right 
panel) allows the BAL trough to be identified. 



to detect BAL troughs at the extreme edges of spectra is a 
particular advantage of the NMF-based technique. 
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Figure 4. Top panel: Flux spectrum (black) of SDSS 
J104945. 36+285823.3 with initial 14-component NMF reconstruc- 
tion (red). The initial reconstruction was identified as having an 
unphysical emission line profile. Bottom panel: the same quasar 
spectrum (black) and the NMF reconstructions after additional 
masking, for 14, 13, 12 and 11 NMF components. At 11 compo- 
nents a satisfactory emission line profile is found and this recon- 
struction was adopted. 



4.6 Rejecting unphysical emission line profiles 

Although the automated technique described above pro- 
duced excellent reconstructions of both the continuum and 
emission lines for most spectra, in up to 5 per cent of spec- 
tra an unphysical 'dip' could be identified in the profile of 
the Civ emission line (Fig. 4). Spectra where such arti- 
facts were present in the NMF reconstructions were iden- 
tified via a simple 'minima detection' algorithm. Much as 
for the scheme used to generate an absorption equivalent 
width (Section 4.1), a template spectrum was 'bent' to fit 
the overall shape of the NMF reconstruction and a simple 
scheme was used to identify any location where the NMF 
reconstruction fell below the template spectrum. 

In many cases the dip in the reconstructed emission 
line was caused by incomplete masking of absorption in the 
observed spectrum, so spectra with an identified dip were 
reprocessed with an additional mask applied to the affected 
region. The resulting fits were examined by eye and accepted 
if the dip was no longer visible. Where a dip was still appar- 
ent in the reconstruction the number of components used to 
create the reconstructions was reduced one by one until the 
dip was eliminated. For this purpose smaller sets of NMF 
components were generated using the same input spectra as 
used to generate the initial base NMF sets. 

Reducing the number of components reduces the free- 
dom of the fitting procedure to create unphysical line pro- 
files, eliminating the dips in almost all cases. Quasars whose 
reconstructions still exhibited a dip in the C IV emission line 
with as few as two components are discussed further in the 
following section. 
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4.7 Manual masking of spectra 

Spectra with reconstructions that met one or more of the fol- 
lowing criteria were examined by eye to determine if the au- 
tomated masking had failed to correctly separate absorbed 
and unabsorbed regions: 

(i) dip in C IV emission line profile still present with only 
two NMF components 

(ii) xl > 2 

(iii) at least 500 pixels masked 

Together these criteria selected ~2.5 per cent of the recon- 
structions, of which ~20 per cent (~0.5 per cent of the total) 
were judged to require manually-defined masks 2 . 

New reconstructions were generated for the selected 
spectra with manually-defined masks used in place of the 
automated masks. The results were visually inspected to 
identify unphysical emission line profiles and, where these 
were found, the number of components was reduced as in 
Section 4.6. At this point reconstructions with only two com- 
ponents were allowed. In a very small number of cases the 
reconstructed emission line profiles still exhibited unphysi- 
cal profiles with as few as two components; in these cases 
the best reconstruction available was chosen even if a dip 
was present. 

Notwithstanding the presence of the occasional patho- 
logical case where the automated NMF continuum genera- 
tion fails, the success of the NMF-based scheme is evident 
from the extremely low percentage of quasars (0.5 per cent) 
that require any manual intervention and that only 54 spec- 
tra posses reconstructions that we judge to be significantly 
sub-optimal. 



4.8 Visual inspection of spectra 

As a final check, all spectra with non-zero BI were visually 
inspected in order to discard any in which the observed BAL 
trough was the result of a poor reconstruction, or consisted 
solely of bad pixels. Troughs that were visible in the data but 
were extended by a number of bad pixels had their BI val- 
ues recalculated with those pixels removed. In cases where 
a small number of bad pixels appeared within a trough, no 
changes were made. Again, the number of spectra where the 
automated procedure requires tweaking is extremely small: 
352 spectra had BAL troughs modified or removed from 
the catalogue because of regions of bad pixels, and 46 had 
troughs removed because of poor continuum fits. Of the 46 
spectra, the problem in all but 10 affected only the Mgn 
trough. Cases in which a BAL trough exists but is not iden- 
tified are discussed in Sections 6 and 7. 



2 Additionally, the automated procedure had failed to produce 
fits for three spectra: for one the automated mask covered the 
entire spectrum; for the second, the power law slope correction, as 
described in Section 4.4, was so great it suppressed all information 
in the spectrum; for the third, large sections of the SDSS spectrum 
had recorded fluxes of ±oo. Manually-defined masks were also 
created for these three spectra. 



5 THE CATALOGUE OF BALQSOS 

Data derived from the NMF fits to the SDSS DR6 quasar 
spectra are presented in Table 1. For each spectrum we 
present the SDSS coordinate object name, right ascension 
(RA) and declination (both using J2000 coordinates), the 
modified Julian date (MJD), plate and fiber numbers of the 
spectroscopic observation, the S/N, flux and luminosity at 
1700 A, redshift, i-magnitude, whether the object was tar- 
geted in the SDSS as a HIZ or LOWZ quasar (see Section 8), 
whether the spectrum is the primary spectrum for an object, 
and whether the object was used to generate NMF compo- 
nents (GenComp). Regarding the individual NMF fits, wc 
present the number of components used (Ar comp ), whether 
that number is a reduction from the number available (Red- 
Comp), the \v of the fit, the number of pixels retained af- 
ter masking (A/pi x ) , the index of the power-law slope fitted 
to identify red objects (Slope), whether a slope correction 
was applied (SICor), and whether an additional mask was 
applied to cover a dip in the reconstruction (DipMask) or 
defined manually (ManMask). 

For each of Siiv A1400, Civ A1550, Aim A1860 and 
Mgn A2800 we list the BI, mean depth (d) and minimum 
and maximum velocities (v m in, «max) of any identified BAL 
troughs, and the minimum and maximum velocities covered 
by both the SDSS spectra and NMF components (w C ov,min, 
^cov.max). The coverage velocities are set to zero when the 
BAL region has no coverage at all. We also list the number of 
bad pixels within the BI integration range (JVbr,) and within 
any BAL troughs (JVbt), and the same for pixels affected 
by sky lines (Nsn. and Nst)- Finally, four flags are listed for 
each species: Inc is set if the coverage of the BI region is 
incomplete, CBP denotes a manual change has been made 
to the measured BI and associated properties because of 
bad pixels in the originally-identified trough, CPF denotes 
a manual change has been made because of a poor NMF fit, 
and BBP is set if there is a broad region of bad pixels that 
could be concealing a BAL trough. 

The above information is presented for all quasars in 
the sample described in Section 3, regardless of whether any 
BAL troughs are identified. 

The SDSS names are taken from the SDSS DR7 Legacy 
Release where available. To calculate the S/N, flux and lu- 
minosity at 1700 A the NMF reconstruction of the flux was 
smoothed by a median filter with width 41 pixels, and the 
region between 1650 and 1750 A was extracted. The flux 
quoted (Firoo) is the median value within this range, while 
the S/N (SNnoo) is the median of the smoothed flux spec- 
trum divided by the SDSS per-pixel noise spectrum. Bad 
pixels and those affected by sky lines were excluded from 
the S/N calculation as their noise values are unreliable or 
nonexistent, and the same pixels were excluded from the 
flux measurement for consistency, particularly with regard 
to the correlations measured in Section 7.2. 

The luminosity (AL1700) is calculated from F1700 using 
a luminosity distance based on the redshift given in Table 1, 
and is quoted after multiplying by A = 1700 A for units of 
ergs -1 . The S/N, flux and luminosity are listed as zero if the 
range 1650-1750 A is outside the rest-frame coverage of the 
SDSS spectrum, or is covered entirely by bad pixels. The i- 
magnitudes are SDSS i-band PSF magnitudes, corrected for 
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Galactic extinction according to the dust maps of Schlegel 
et al. (1998). 

The BI values are calculated according to equation 1 
using the rest-frame zero-velocity wavelengths 1402.77, 
1550.77, 1862.79 and 2803.53 A for Siiv, Civ, Aim and 
Mgn, respectively. The corresponding wavelengths for 
v=— 25 000 km s -1 , the maximum velocity at which we 
search for BAL troughs, are 1290.61, 1426.78, 1713.85 and 
2579.37 A, respectively. BAL systems with outflow velocities 
greater than this limit are not identified, and in rare cases 
where a C IV outflow has extremely high velocity it could 
potentially be misidentified as a Siiv system. 

The mean depth, d, is the mean of the flux ratio depth 
over all pixels in the Bl-defined trough. In spectra with more 
than one distinct trough for a single species d is the mean 
over all troughs. Similarly, the minimum and maximum ve- 
locities listed give the minimum and maximum for all iden- 
tified absorption. 

In calculating the BI and d values, 12.0-A regions 
around 5578.5 and 6301.7 A in the observed frame were in- 
terpolated over to remove the effect of the prominent sky 
lines at those wavelengths. Although such an interpolation 
can smooth over regions that genuinely rise above a flux ra- 
tio of 0.9, it is more common for the presence of a sky line 
within a broad absorption trough to halt the BI integration 
despite the true flux ratio still being below 0.9. The red- 
shift regions in which the sky lines fall within the C IV BI 
integration region are 2.63<z<2.91 and 3.10<z<3.42. 

Table 1 includes data for both 'primary' (best) and 'du- 
plicate' observations, so many objects are listed more than 
once. To remove duplicates, leaving exactly one spectrum 
for each object in the catalogue, only those spectra with 
Primary=l (in column 14) should be used. Duplicate obser- 
vations can be matched with their corresponding primary 
observations using the SDSS object name (column 1) or co- 
ordinates (columns 2 and 3). 
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Figure 5. Distributions of BI values. Thick lines in the upper two 
panels show the distributions for LoBALQSOs, i.e. those quasars 
that also have non-zero BI for Aim or Mgn, scaled for clarity to 
have the same peak values as the overall distributions. Note that 
the Al III and Mg II histograms have wider bins than the high- 
ionisation species due to the smaller number of quasars included. 
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5.1 Distributions of absorption properties 

In total, 3547 BALQSOs are identified, including 811 Siiv, 
3296 Civ, 214 Aim and 215 Mgn BALQSOs. Note that 
many BALQSOs show absorption in more than one species. 
In these totals, and in all following calculations, only 'pri- 
mary' SDSS spectra are used. 'Duplicate' spectra are dis- 
cussed in Section 5.2. 

The distributions of BI values for the four species exam- 
ined are shown in Fig. 5. It can be seen that the two high- 
ionisation species, C IV and Siiv, follow the same BI distri- 
bution, peaking at 2000-3000 km s _1 in the log-space dis- 
tribution, while the low-ionisation species, Aim and Mgn, 
peak below 1000 kms -1 . However, in cases where both a Hi- 
BAL and a LoBAL trough are visible, the typical HiBAL BI 
is greater than for the HiBALQSO population as a whole by 
a factor of ~3, as shown by the thick lines in Fig. 5. 

The extreme nature of the HiBAL absorption in LoB- 
ALQSOs is further illustrated by Fig. 6, in which the ratio 
of LoBALQSOs to the total BALQSO population is shown 
as a function of Si IV or C IV BI. Only BALQSOs with com- 
plete coverage of the Al III BAL region were included in the 
calculation. The LoBALQSO population consists almost en- 
tirely of objects with HiBAL BI values of several thousand 



Figure 6. Ratio of LoBALQSOs to the total BALQSO popu- 
lation, as a function of Si IV (top panel) or C IV (bottom panel) 
BI. 



kms 1 or more, and conversely almost all BALQSOs with 
HiBAL BIMOOOOkms -1 also exhibit LoBAL absorption. 

The differences between high- and low-ionisation ab- 
sorption properties can also be seen in the distributions of 
d values, shown in Fig. 7: the high-ionisation species peak 
at d~0.6, while the low-ionisation species peak at ~0.3. As 
expected from the BI values, the typical depth of a HiBAL 
trough in a LoBALQSO is greater than in the general Hi- 
BALQSO population. However, Fig. 8 shows there is a large 
population of BALQSOs with very deep HiBAL troughs but 
no LoBAL absorption. 



5.2 Repeat observations 

A number of quasars were observed more than once in the 
SDSS spectroscopic survey. Such quasars provide a conve- 
nient test of the reliability of the NMF reconstructions and 
the resulting balnicity measurements, as the actual balnic- 
ities of most objects are not expected to vary greatly over 
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Figure 7. As Fig. 5, but for mean depth values. The dashed 
line marks the minimum mean depth of 0.1 allowed by the BI 
definition (equation 1). 
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Figure 8. As Fig. 6, but for mean depth values. The dashed 
line marks the minimum mean depth of 0.1 allowed by the BI 
definition (equation 1). 



the time-scales of the survey (Lundgren et al. 2007; Gibson 
et al. 2008). 

Counting only those observations for which the C IV 
Inc and C IV BBP flags are not set, there are 2051 quasars 
with between two and seven observations in DR6. Of the 
3288 pairs of observations available, there are 300 for which 
both NMF reconstructions give a C IV BALQSO classifica- 
tion, and a further 73 for which one reconstruction does but 
not the other. The distribution of differences between the 
pairs of Civ BI values, ABI, is shown in Fig. 9. For the 
majority of cases the discrepancy is less than 1000 kms -1 . 
Visual inspection confirms that, as expected, most of the 
pairs with larger ABI consist of observations with signifi- 
cantly differing S/Ns. As detailed in Section 7, the measure- 
ment of BALQSO properties is very sensitive to S/N, even 
in the ideal situation in which the NMF reconstructions do 
not vary. 

The presence of pairs in which one observation is classi- 
fied as a BALQSO, but the other is not, is confirmation that 
not all BAL troughs are identified by the NMF procedure. 
The resulting incompleteness is addressed in Section 7. 
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Figure 9. Solid line: distribution of differences in Civ BI values 
between different observations of the same quasars. Dotted line: 
distribution of C IV BI values for which the paired observation has 
BI = 0. 
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Figure 10. Observed BALQSO fractions as a function of red- 
shift. As in Figs. 5 and 7 different bin widths have been used for 
different species. Binomial errors are shown. The different redshift 
ranges covered reflect the redshifts at which each emission line is 
observable in the SDSS spectra. 



5.3 Observed BALQSO fractions 

The observed BALQSO fractions for Siiv, Civ, Aim and 
Mgn as a function of redshift are shown in Fig. 10. Only 
spectra for which the relevant Inc and BBP flags were not 
set were included when calculating these fractions. There 
is a strong apparent redshift dependence of the BALQSO 
fractions, the implications of which are addressed in Sec- 
tions 7-9. The highest-redshift BALQSO identified is a 
Civ BALQSO at redshift, ,2=5.0314, but for z>5 no Civ 
BALQSO fraction is measured because all spectra have in- 
complete coverage of the C IV absorption region. Of the 47 
spectra at z>5 for which complete coverage of the Siiv 
absorption region is available, none are identified as Siiv 
BALQSOs. Summing over all redshifts, the BALQSO frac- 
tions for Siiv, Civ, Aim and Mgn are 3.4±0.1, 8.0±0.1, 
0.38±0.03 and 0.29±0.02 per cent, respectively. The uncer- 
tainties quoted are l-<r binomial errors. 
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Figure 12. Comparison of C IV BI values presented in this paper 
and in Gibson et al. (2009). Left panel: NMF-derived BI values as 
presented in Section 5. Right panel: NMF-derived BI values after 
boxcar smoothing the observed flux. 



Figure 11. As Fig. 10, but with the BALQSO fractions as a 
function of luminosity. 



The observed fractions as a function of AL1700 are shown 
in Fig. 11. Only objects with redshifts 1.2<z<4.6, i.e. where 
1700 A rest-frame is present in the SDSS spectra, are in- 
cluded. The luminosities are not corrected for dust extinc- 
tion at this point. The distributions for SilV, Civ and Aim 
all show a strong tendency towards higher BALQSO frac- 
tions at high luminosity, but this result is strongly biased by 
the relative probabilities of detecting a given BAL trough 
at different S/Ns; this bias is addressed in Section 7. The 
fraction of Mgll BALQSOs shows a marked increase at 
AZ/1700 < 10 45 ergs _1 that is not observed in the other 
species; the BALQSOs contributing at these luminosities are 
all very heavily dust-reddened, resulting in a very low flux 
at 1700 A. 



6 COMPARISON TO PREVIOUS RESULTS 

G09 presented a catalogue of BALQSOs in the SDSS DR5, 
in which the unabsorbed emission was estimated by fitting 
a dust-reddened power law with Voigt profiles for the strong 
emission lines. The DR5 spectra are a large subsample of 
the DR6 spectra used here and a comparison of the results 
is thus possible. 

The left panel of Fig. 12 compares the C IV BI values 
presented in this paper to those in G09, including only those 
spectra that were available in DR5. It is immediately clear 
that in general there is good agreement between the mea- 
surements. However, there are some notable differences. In 
particular, there are many more objects for which the G09 
BI value is several thousand kms -1 higher than the NMF- 
derived BI value than there are for which the reverse is true; 
this can be seen in the asymmetry in the distribution of 
points around the line of equality. Similarly, there are 1206 
objects that are identified as C iv BALQSOs in G09 but 
not in this paper, but only 149 that are identified as C IV 
BALQSOs here but not in G09. 

Before calculating the BI values, G09 smoothed the ob- 
served flux of each object using a boxcar with a width of 
three pixels. This smoothing was performed to reduce the 
number of cases where the noise in a single pixel breaks up a 
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Figure 13. As Fig. 12, but comparing the NMF results to those 
from Trump et al. (2006). 



BAL trough. When the same smoothing is applied before re- 
calculating the NMF-derived BI values, the results from the 
two catalogues follow each other more closely, as shown in 
the right panel of Fig. 12. The results presented in Section 5 
use the un-smoothed spectra; the resulting incompleteness 
is corrected for in the following Sections. Visual inspection 
confirms that the spectra for which the BAL status is dif- 
ferent in different catalogues tend to be marginal cases for 
which it is debatable whether any absorption present is deep 
or broad enough to be classified as a 'broad absorber'. 

For the strongest absorbers (BI > 10 000 kms -1 ) Fig. 12 
shows a trend for the NMF-derived BI values to be greater 
than those from G09. The reason for the discrepancy is not 
always clear but the trend implies that, for high-BI quasars, 
the NMF algorithm tends to place the reconstructed contin- 
uum higher than G09 reconstructions. 

Similar trends can be seen when comparing to the DR3 
catalogue presented by Trump et al. (2006), as shown in 
Fig. 13, although the initial asymmetry is less pronounced 
and there is a tendency for the smoothed NMF-derived BI 
values to be higher than the values from Trump et al. (2006) 
across the entire range, again implying the NMF reconstruc- 
tions have a higher continuum level. 

The varying classifications between these three cata- 
logues, each of which used the same metric on essentially 
the same data, emphasizes the importance of an accurate fit 
to the unabsorbed continuum. Relatively small errors in the 



A strong redshift dependence of the BALQSO fraction 11 



position of the continuum can often change a classification 
between non-BAL and BALQSO, or change the measured BI 
by thousands of km s~ . We expect that the NMF-dcrivcd 
fits presented in this paper will be consistently more accu- 
rate than the fits used to create previous catalogues, due 
to the manner in which the NMF procedure combines data 
from the entire observed spectrum to reconstruct the con- 
tinuum. In cases in which large portions of the spectrum are 
absorbed, such a method has clear advantages over one that 
attempts to directly fit line profiles in the absorbed regions. 
Visual inspection of reconstructions of synthetic BALQSO 
spectra, such as those described in Section 7, suggests that 
the NMF procedure reliably produces accurate reconstruc- 
tions of quasar continua, even in cases where large regions 
are affected by absorption. 

Recently Scaringi et al. (2009) have presented a 
BALQSO catalogue in which 'learning vector quantisation' 
(LVQ), a machine- learning algorithm, was used to classify 
quasars as BALQSOs or non-BAL quasars. These classifi- 
cations were compared to those from G09 and, in the cases 
where the two disagreed, a visual inspection was performed 
to provide a final hybrid-LVQ classification. 

The final hybrid-LVQ catalogue leaves out very few of 
the DR5 quasars that are classified as BALQSOs in this 
paper. Conversely, nearly f400 quasars have BI=0kms _1 
when measured from the NMF results but are classified as 
BALQSOs by Scaringi et al. (2009). In part this discrepancy 
is due to the increased probability of the G09 fits producing 
a non-zero BI relative to the NMF results, but the decision 
by Scaringi et al. (2009) to include absorbers with veloc- 
ity <3000kms _1 or velocity width <2000kms _1 if a visual 
inspection indicates they have similar properties to other 
BALQSOs, results in the inclusion of many objects that by 
definition will always be excluded from a purely Bl-based 
catalogue. The resulting discrepancy serves to underline the 
current difficulties in objectively defining 'broad absorption'. 

The public release of the Civ-region NMF-continuum 
fits, for each of the 48 146 spectra in the sample with com- 
plete or partial coverage of the C IV region, allows anyone 
to use them to define BALQSO samples using any metric 
they wish. Additionally, future continuum fits to the same 
objects can be compared to the NMF-derived fits to aid in 
identification and explanation of discrepancies. 
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Figure 14. Redshift and luminosity bins for which the BALQSO 
detection fraction was calculated. Grey points show the redshift 
and luminosity of all quasars in DR6 with neither the C IV Inc 
nor the Civ BBP flags set; dashed lines show the boundaries of 
the bins. 



of any parameters that are correlated with spectrum S/N, 
such as luminosity or redshift. 

It is also possible for non-BAL quasars to be incorrectly 
identified as BALQSOs due to a reconstruction that places 
the continuum too high. However, following the results of the 
comparisons in Section 6, such false positives are expected 
to be rare in the catalogue presented here. 

The probability of a particular C IV BALQSO being 
detected by the NMF-based routine, past, was estimated 
by inserting BAL troughs with known properties into non- 
BAL quasar spectra and processing the resulting synthetic 
BALQSOs in the same way as for observed spectra. The 
value of pdet for any selected set of quasar and BAL proper- 
ties is then simply the number of correctly identified BALQ- 
SOs, divided by the total number of synthetic spectra cre- 
ated. As pact was expected to be a strong function of red- 
shift and S/N it was calculated separately for each of the 
rcdshift-luminosity bins shown in Fig. 14. 

The methods described below could be applied to BAL 
systems in any species, but in this work we examine only 
C IV systems, by far the most common form of BALQSO. 



7 Civ BALQSO DETECTION EFFICIENCY 

The catalogue presented in Section 5 is not complete. A 
number of BALQSOs will exist in the SDSS DR6 spectro- 
scopic survey but have a measured BI of zero. There are 
two principal reasons why this can occur. Firstly, the recon- 
struction of the emission may place the continuum level too 
low, reducing the apparent depth of a trough or eliminating 
it completely. Secondly, even for a perfect reconstruction, 
a noise spike in the observed spectrum may lie above 90 
per cent of the unabsorbed emission, halting the BI inte- 
gration. This incompleteness in identifying BALQSOs from 
observed spectra has previously been noted (Knigge et al. 
2008; G09) but not quantified. Because the level of incom- 
pleteness is strongly dependent on the S/N of the observed 
spectra, an extensive analysis is required in order to quantify 
the BALQSO fraction within the SDSS, /sdss, as a function 



7.1 Synthetic BALQSO spectra 

In each redshift bin a sample of 50 quasars was chosen such 
that each quasar had 9^SN_R<25 and no significant absorp- 
tion in the Civ trough region. For the 4.0^z<4.5 redshift 
bin the minimum SN_R was reduced to 7 to ensure sufficient 
spectra were available. To create spectra with the same prop- 
erties but over a range of S/N, the observed quasar spectra 
were degraded by adding in sky spectra from the same SDSS 
plate. An example of this degradation is shown in Fig. 15, 
where a quasar spectrum has I, 3, 7 and 15 sky spectra 
added to reduce the S/Ns by nominal factors of y/T, V^ 1 , 
V^ and vTo, although the actual S/N was always recalcu- 
lated for each degraded spectrum. 

The spectra were degraded by increasing factors of V2 1 
in S/N until there were insufficient sky spectra from the 
SDSS plate to continue; in most cases this halted the process 
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Figure 15. Top panel: Flux spectrum of SDSS 
J133533. 69+032610. 8. Other panels: the same spectrum af- 
ter addition of 1, 3, 7 and 15 sky spectra. 
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Figure 16. Flux ratios used to generate synthetic BALQSOs 
to test the BALQSO identification completeness. Each profile is 
offset for clarity. The short-dashed lines mark the zero-velocity 
wavelengths of Si IV and C IV, and the long-dashed and dot-dashed 
lines mark the wavelengths at v = —3000 and —25000 km s , 
respectively. BI values for Si iv/C IV are, from top to bottom, 
0/613, 272/1548, 1246/4231, 4434/9367 and 152/1327 (all values 
in kms -1 ). 



at 31 additional sky spectra, for a S/N nominally reduced 
by a factor of V32' « 5.66. 

The BAL troughs used to create synthetic BALQSOs 
were created from the flux ratio profiles of a set of five ac- 
tual BALQSOs, chosen to provide a range of BAL velocity 
widths. The initial flux ratios were generated by dividing the 
observed spectrum by the NMF reconstruction, and smooth- 
ing the result by a boxcar window with width 15 pixels. The 
maximum flux ratio was set to unity, and any absorption 
outside the principal C IV and Si IV troughs was erased. The 
resulting smoothed ratios are shown in Fig. 16. 

Preliminary tests suggested that the fraction of BAL 
troughs recovered depended more strongly on the mean 
depth, d than on the BI itself, so each of the flux ratio profiles 
in Fig. 16 was scaled to create profiles with Civ d = 0.15, 
0.2, 0.25, 0.3, 0.4, 0.5, 0.7 and 0.9. A simple linear scaling 
was used. Fig. 17 shows the resulting troughs at different 
values of d for one of the input flux ratio profiles. 

To insert the troughs into non-BAL quasar spectra, sim- 
ply multiplying the reference spectra by the selected flux ra- 
tio profiles would have artificially reduced the noise within 
the troughs. Instead the original NMF reconstructions of the 
non-BAL quasar spectra were multiplied by the flux ratios, 
and the resulting trough shapes - the difference between the 
reconstruction and the reconstruction multiplied by the flux 
ratio - were subtracted from the observed spectra, as shown 
in Fig. 18. 

By inserting each synthetic BAL profile at each mean 
depth into each of the spectra in the test sample for all levels 
of S/N degradation, the 50 input spectra in each redshift bin 
were expanded to over 10 000, covering extended ranges of 
S/N and BAL properties. NMF reconstructions of these syn- 
thetic BALQSO spectra were calculated by the method de- 
scribed in Section 4 to determine if the BAL troughs would 
be detected. 
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Figure 17. Final flux ratio profiles with Civ d values as shown, 
generated from the 4th profile from the top in Fig. 16. Each 
profile is offset for clarity. The short-dashed lines mark the 
zero-velocity wavelengths of Si IV and C IV, and the long-dashed 
and dot-dashed lines mark the wavelengths at v = —3000 and 
—25000 km s , respectively. BI values for Si iv/C IV are, from top 
to bottom, 142/800, 555/1847, 1020/2820, 1475/3747, 2469/5625, 
3450/7501, 5655/11198 and 8775/14593 (all values in kms" 1 ). 



7.2 Relating luminosity to signal-to- noise ratio 

The variation in pdet with S/N is crucial because S/N cor- 
relates with important physical parameters such as redshift 
and luminosity. The synthetic BALQSO spectra have known 
redshifts, from the original non-BAL quasars on which they 
are based, but the luminosities of the degraded spectra are 
not well-defined. To relate the S/N to luminosity, the SNiroo 
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Figure 18. Method of inserting a BALQSO trough into a non- 
BAL quasar spectrum. Top panel: Flux spectrum (black) of 
SDSS J122626.22+270437.0 with the initial NMF reconstruction 
(dashed red) and the reconstruction multiplied by the chosen flux 
ratio spectrum (solid red). Middle panel: difference between the 
initial reconstruction and the reconstruction multiplied by the 
flux ratio spectrum. Bottom panel: quasar spectrum after sub- 
tracting the reconstruction difference spectrum shown in the mid- 
dle panel. 



and AZ/1700 values for all quasars with SNnoo < 25 were ex- 
tracted from the catalogue. In each redshift range a linear 
regression fit was made to empirically determine the rela- 
tionship between S/N and luminosity. The values used to 
define the luminosity bins were then converted to S/N val- 
ues using these linear relationships, allowing each synthetic 
spectrum to be assigned to a redshift -luminosity bin. 



7.3 Detection probabilities 

The derived values of pdct as a function of input mean depth, 
din, are shown in Fig. 19 for all redshift and luminosity bins. 
The uncertainties were calculated from bootstrap realisa- 
tions, in each of which 50 spectra were chosen at random 
with replacement from the 50 input spectra. 

As expected, the detection probabilities are highest for 
quasars with high luminosity (and hence high S/N) and deep 
BAL troughs. In general, for AL1700 ^ 10 45 ' 9 ergs -1 , there is 
a value of d[ n above which all or nearly all BAL troughs are 
detected; below this depth pdot falls rapidly. In the redshift 
range 4.0^Jz<4.5 (bottom-right panel in Fig. 19) none of 
the synthetic spectra reached the low S/N required for the 
lowest luminosity bin, although the results from the higher- 
luminosity bins suggest pdot would be very low. With only 



four quasars observed in this bin, we exclude the bin in the 
following analysis. 

The detection probabilities calculated here were based 
on the actual mean depth of the BAL trough, but an exami- 
nation of the results from the synthetic BALQSOs suggested 
that on average the observed mean depth, d ob s, was slightly 
deeper than the input mean depth, di n . The offset is due in 
part to (i) a tendency for the shallow wings of the input ab- 
sorption profiles to be excluded from the observed BI regions 
and (ii) the possibility for an observed trough, scattered to 
smaller depths, to be mis-identified as a non-BAL quasar. 
To correct for this trend, in each redshift-luminosity bin a 
linear regression line was fitted to di n as a function of d b s , 
and the observed mean depths presented in Section 5 were 
reduced according to the regression line before /sdss was 
calculated. The typical corrections are small, with a median 
Ad of only 0.03. The corrected depth, d COI — d bs — Ad, is 
taken to be equivalent to the input depth, di n , of the syn- 
thetic quasars. 



8 DIFFERENTIAL SDSS TARGET SELECTION 

Any BALQSO fraction derived directly from the quasar 
spectra in the SDSS will not be the intrinsic BALQSO frac- 
tion, because the SDSS spectroscopic survey is not 100 per 
cent complete. Quasar candidates were chosen for spectro- 
scopic observations based on their photometric properties, as 
well as specifically targeting sources identified in the FIRST 
radio catalogues (Becker et al. 1995) , giving an overall com- 
pleteness, for quasars brighter than an observed i-band mag- 
nitude, of >90 per cent (Richards et al. 2002). Crucially, the 
presence of material causing a BAL trough along the line-of- 
sight to a quasar changes the quasar's observed photometric 
properties. As a result, the probability of a BALQSO be- 
ing selected by the target selection algorithm is different to 
that for a non-BAL quasar whose properties are otherwise 
the same. Differential selection probabilities were calculated 
by Reichard et al. (2003b), using simulated BALQSO and 
non-BAL quasar magnitudes; we follow a similar but more 
involved procedure here. 

First, model quasar spectra are generated covering a 
wide range of expected quasar properties (Section 8.1). As 
well as different BAL properties we simulate different lev- 
els of dust reddening, a range of Lyman limit cut-off wave- 
lengths and different continuum power laws. Each of these 
factors has an effect on the photometric properties of quasars 
and hence can change the SDSS completeness. In Section 8.2 
we calculate observed ugriz magnitudes from the model 
spectra at a range of redshifts and luminosities, allowing 
us to process them with the SDSS quasar target selection 
algorithm to determine the probability each model quasar 
would be targeted. The resulting completeness values are 
presented as contour plots in redshift-magnitude space in 
Section 8.3. 

In order to make use of the completeness values derived 
for each individual model quasar, they must be weighted ac- 
cording to the expected probability distribution functions of 
each input parameter. In Section 8.4 we describe our meth- 
ods for determining such distributions. Finally, in Section 8.6 
we combine the weighted completeness values with an input 
luminosity distribution to derive the overall completeness 
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Figure 19. Probability of the NMF routine detecting a BALQSO in SDSS as a function of input d. Each panel represents a single redshift 
bin, as labelled. The curves within each panel represent different luminosities: AL1700 < 10 45,9 (solid blue), 10 459 ^ AL1700 < 10 461 
(dotted red), 10 461 ^ AL1700 < 10 463 (dashed green) and 10 463 s£ AL1700 (dot-dashed black) (all luminosities in ergs -1 ). 



for BALQSOs and non-BAL quasars as a function of red- 
shift and luminosity. 



8.1 Model quasar spectra 

The wavelength coverage required to synthesise SDSS ugriz 
photometry is greater than that provided by the SDSS spec- 
tra, so more extended model spectra must be used. The ref- 
erence quasar model for the investigation is that described 
in Maddox et al. (2008). The Maddox et al. model quasar 
SED reproduces the variations of the median colours of the 
SDSS quasars over the full redshift range 0.2<z<5.0 to a 
high degree of accuracy. A graphical indication of the suc- 
cess of an earlier version of the model in reproducing the 
colours of SDSS quasars can be found as fig. 8 of Chiu et al. 
(2007). The implementation employed here incorporates a 
refined parametrization of the Ly-a forest opacity based on 
the work of Faucher-Giguere et al. (2008). The model results 
in an excellent match to the observed median ugr colours of 
the SDSS quasars over the redshift range 1.7<z<5.0, repre- 
senting a significant improvement over the parametrization 
based on the results of Songaila (2004) that was employed 
previously. 

The Maddox et al. (2008) model SED does reproduce 
very well the locus of observed quasar colours over an ex- 
tended range of redshifts. However, it uses a fixed contin- 
uum model while the quasar population exhibits an intrin- 



sic range in overall continuum shape. As only a small frac- 
tion of quasars deviate significantly from the model SED 
such quasars typically have little effect on the overall SDSS 
completeness. None the less, there are regions of parameter 
space in which the quasar locus passes through the colour- 
volume occupied by main sequence stars and the predicted 
completeness values plummet, while those quasars with un- 
usually blue SEDs skirt the stellar main sequence and do 
satisfy the SDSS quasar selection criteria. In such regions 
the 'blue' quasars do make a significant contribution to the 
completeness values. To include this effect an additional se- 
ries of model quasars, with the rest-frame ultraviolet power 
law slope a (/„ = u~ a ) bluer by 0.5, was generated. 

The reference quasar spectra were modified in the fol- 
lowing ways to reproduce the range of SEDs present among 
the quasar population. The spectrum was dust-reddened us- 
ing an empirically-derived extinction curve (Maddox et al. 
2010, in preparation) with E{B - V) values of 0.0, 0.05, 0.1, 
0.15 and 0.2. The Maddox et al. (2010) extinction curve is 
very similar to the extinction curve of the Small Magellanic 
Cloud (SMC), which is frequently used in studies of quasars. 
Like the SMC curve, there is no 2175-A feature but the in- 
crease in extinction below 1600 A is somewhat shallower, 
although still always greater than for a Large Magellanic 
Cloud extinction curve. 

Changing the wavelength of the Lyman limit cut-off, 
All, can have a large effect on the observed quasar colours, 
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particularly by changing the it-band flux and hence the u- 
g colour. Such an effect can move quasars into or out of 
the stellar locus, changing their targeting status with only 
a small shift in All, a feature which Prochaska, Worseck & 
O'Meara (2009) and Worseck & Prochaska (2010) have high- 
lighted recently. To explore the impact of a Lyman limit cut- 
off we generated sets of objects following the tracks in ugriz 
magnitude space defined by varying All between 600 and 
912 A. The objects were spaced along the tracks such that 
the step size, Am = y/Au 2 + Ag 2 + Ar 2 + Ai 2 + Az 2 \ was 
constant, with the number of steps chosen individually for 
each test quasar to give Am ~ 0.1. 

BALQSOs were simulated by employing a modified sub- 
set of the flux ratios used in Section 7. The flux ratio pro- 
files used were the first, third and fourth from the top in 
Fig. 16, each with mean depths of 0.15, 0.3, 0.5, 0.7 and 0.9. 
In each case the C IV trough profile was replicated at the 
wavelengths of the Siiv, Nv and Lya lines. In the case of 
the Si IV and N v lines the C IV trough was reduced to 80 per 
cent of its original depth, in order to ensure consistency be- 
tween different profiles representing different BI ranges. In 
addition, the flux bluewards of 1050 A was reduced by the 
mean depth of the C IV trough, to approximate the com- 
bined effect of broad absorption from a number of other 
high-ionisation species. The existence of these troughs can 
be seen directly by comparing composite SDSS spectra of 
BALQSOs and non-BAL quasars with redshifts 3.0<z<4.0 
(such as those in Fig. 22), where the rest-frame 800-1100 A 
wavelength region is visible in the spectra, showing a system- 
atic depression in the BALQSOs compared to the non-BAL 
quasars below ~1050 A. 

The BALQSO completeness values calculated below 
were interpolated to cover all eight of the mean depths for 
which the detection efficiency was calculated in Section 7, 
for each of the three profiles used, giving 8 x 3 = 24 Bl/d 
ranges. 

The recipe adopted for the BAL trough simulations 
closely matches the observed properties of BAL quasars 
among the SDSS spectra. The non-BAL quasar spectra were 
also retained and the full suite of model spectra covers an 
extended range in the reddening and BAL properties of 
quasars. 



8.2 Synthetic SDSS magnitudes 

In order to measure the completeness for the model quasar 
spectra, SDSS magnitudes were generated for each and mul- 
tiple realisations were created by scattering according to 
typical photometric errors. The sets of scattered magnitudes 
were processed by the SDSS quasar target selection algo- 
rithm to determine if each would be targeted. The com- 
pleteness for each model quasar could then be estimated as 
the fraction of realisations, scattered from the true magni- 
tudes for that model, that were targeted. The procedure is 
described in detail below. 

Synthetic SDSS colours for the model quasars, over a 
range of redshifts, were determined as described in Hewett 
et al. (2006). ugriz magnitudes were then derived by setting 
the i-magnitude to a specific value and applying the syn- 
thetic colours. The magnitudes were calculated according 
to the asinh magnitude system presented by Lupton et al. 
(1999), to match the SDSS photometric data. The range 
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Figure 20. Rcdshift-magnitudc combinations tested for com- 
pleteness in the SDSS quasar target selection algorithm. The con- 
tours show the completeness for non-BAL quasars measured by 
Richards et al. (2002), marking the 10, 25, 50, 75 and 90 per cent 
levels. The 90 per cent contour is marked with a dashed line. Red 
crosses show the combinations tested in this work. 



of redshifts and i-magnitudes used is shown in Fig. 20; the 
values were chosen to concentrate on the regions where com- 
pleteness varies rapidly according to Richards et al. (2002). 

The errors in the photometric measurements can cause 
an object to scatter into or out of the target selection region. 
Typical errors were derived from the 'BEST' photometric 
values and errors in the SDSS DR5 quasar catalogue (Schnei- 
der et al. 2007). For each individual photometric measure- 
ment all the measurements in that band with a magnitude 
within 0.25 were extracted, along with their associated er- 
rors. The first and third quartile values of the extracted er- 
ror distribution were chosen to represent 'good' and 'poor' 
observing conditions, respectively. For each set of synthetic 
magnitudes ten 'good' and ten 'poor' realisations were cre- 
ated by adding Gaussian random noise with the appropriate 
1-cr errors. 

The photometric errors of an object are used by the 
SDSS target selection algorithm to assess the significance 
of an object's deviation from the stellar locus in colour 
space. The relevant magnitude errors required are those per- 
taining to the scattered rather than intrinsic photometric 
properties. Thus, the magnitude errors were recalculated 
for each realisation using the 'TARGET' photometric data 
from Schneider et al. (2007), which better reflect the in- 
formation originally used for quasar target selection. The 
resulting sets of magnitudes and errors were processed by 
the SDSS quasar target selection algorithm, as radio-quiet 
stellar (point-source) candidates, to determine the spectro- 
scopic target status of each model quasar realisation. The 
results from each set of twenty realisations were combined 
to estimate the completeness for that model quasar. 

There are a number of criteria under which an object 
can be targeted; the PRIMTARG header item in each SDSS 
spectrum's datafile specifies whether an object was targeted 
as a high-redshift quasar candidate (HIZ), a low-redshift 
quasar candidate (LOWZ), a FIRST radio source or some 
other candidate type. Multiple targeting flags for a single 
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source are allowed. The target selection algorithm is de- 
scribed in detail by Richards et al. (2002). Only the HIZ 
and LOWZ flags are used here: these flags depend only in 
the photometric properties of each individual object, and 
are the criteria under which the majority of SDSS quasars 
were selected. 

To reduce the CPU time and data transfer requirements 
of processing the very large number of test points produced, 
after an initial set of 30 000 000 objects had been processed 
by the target selection algorithm further objects were first 
compared to the initial set. If the two nearest neighbours 
in magnitude space to a new object had the same PRIMTARG 
values then this value was adopted for the new object too. If 
they disagreed, the new object was processed by the normal 
target selection algorithm. Tests on a random subsample of 
objects suggested that comparison to the nearest neighbours 
gave the correct result in 94 per cent of cases; given the large 
number of objects included in each completeness datapoint 
this represents an acceptable error rate. 

8.3 Completeness contours in redshift— magnitude 
space 

Example completeness contour plots are shown in Fig. 21, 
for the normal (not 'blue') quasar SED with a range of in- 
put properties. The overall features for the dust-free quasars 
with All = 912 A (top-left panel for a non-BAL quasar, 
top-right for a BALQSO) are largely as expected from the 
results of Richards et al. (2002) and Reichard et al. (2003b): 
the completeness is very high for both non-BAL quasars and 
BALQSOs with z < 2.1 and i < 19.1, drops sharply for a 
narrow region close to «~2.6, and rises again for higher red- 
shifts. 

Differences between the non-BAL and BALQSO com- 
pleteness values can be seen around z~2.6, where the model 
quasar colours move close to or through the stellar locus. 
The redder u—g colours for BALQSOs, caused by supression 
of the flux bluewards of 1050 A, causes BALQSOs to enter 
and then leave the stellar locus at lower redshifts than non- 
BAL quasars. A further difference is seen at z~3.5, where 
the BALQSO colours again come close to the stellar locus, 
causing a sharp but brief drop in the completeness. At other 
redshifts, in particular z<2.1 and z>3.6, the completeness 
is very similar for BALQSOs and non-BAL quasars. 

Adjusting the position of the Lyman limit cut-off wave- 
length can have a very strong effect on the derived complete- 
ness values, due to the resulting change in the u-magnitude 
of the quasars. The bottom-left panel of Fig. 21 shows an ex- 
treme case in which a very low All =600 A allows significant 
u-band flux at all redshifts, keeping the quasar colours away 
from the stellar locus but also preventing selection based on 
the HIZ targeting. 

The contour plots in Fig. 21 are based on the observed 
i-band magnitude, so the bottom-right panel, with E(B — 
V) = 0.1, shows the effects of dust reddening but not dust 
extinction. However, there are still large differences between 
the results for dust-free and dust-reddened quasars due just 
to the induced change in colours. Reichard et al. (2003b) 
showed that, depending on redshift, the effect of dust can 
be to move the quasar away from or towards the stellar 
locus in colour space. Such motion makes target selection 
more or less likely at different redshifts. The tendency for 



BALQSOs to exhibit more dust reddening than non-BAL 
quasars (e.g. Sprayberry & Foltz 1992; Reichard et al. 2003b) 
is thus important when considering relative completeness 
values for the SDSS quasar selection. 

Combining the different quasar parameters has effects 
on the completeness maps that are not obvious from exam- 
ining the parameters individually. The plots in Fig. 21 only 
illustrate the types of effect that arise. The impact of com- 
binations of dust-reddening, Lyman limit cut-off wavelength 
and BAL properties must be considered in order to quantify 
the relative completeness. 

8.4 Distributions of quasar properties 

In order to apply the completeness maps derived above, esti- 
mates must be made of the distributions of the quasar prop- 
erties on which they are based. Such distributions allow us 
to weight the results from different input parameters accord- 
ingly. There are five properties for which quasar population 
distributions are required: (i) the intrinsic luminosity distri- 
bution of quasars, (ii) the intervening absorber Lyman-limit 
wavelengths, (iii) the actual BAL trough properties, (iv) the 
relative numbers of quasars with 'normal' and 'blue' SEDs 
and (v) dust extinction and reddening, i.e. E(B — V). The 
first four of these are covered here, while the E(B — V) dis- 
tributions are described in Secion 8.5. 

The intrinsic luminosity distribution was modelled as a 
power law, $ (log (AL1700)) oc i(r alo s( AZ 'i7oo) ) with a = 2.2. 
The value of a was determined from a fit to the observed dis- 
tribution of quasars with z < 2.0. Within each redshift bin 
the number density per unit redshift was assumed to be con- 
stant. As the completeness correction is applied to separate 
redshift bins independently the change in number density 
between redshift bins does not enter into the calculations. 

The following results do not depend strongly on the de- 
tails of the luminosity distribution used. The effect of using 
different values of a is to change slightly the overall intrin- 
sic BALQSO fraction while making little or no difference to 
the shape of the fraction as a function of redshift; the effects 
are discussed further in Section 9.2. Tests using a redshift- 
dependent luminosity distribution determined from the ob- 
served quasars in each redshift bin produced very similar 
results. The stability in the determination of the intrinsic 
BALQSO fraction, /i n t (Section 9), is in part because the fi- 
nal result depends only on the ratio of the non-BAL quasar 
and BALQSO completenesses, rather than their absolute 
values. 

In order to weight the contributions from each of the 
Lyman limit wavelengths an initial weight of 50 per cent 
was assigned to All = 912 A, to account for quasars where a 
Lyman limit system exists within the host galaxy. This frac- 
tion is consistent with a visual inspection of SDSS quasar 
spectra. To assign the weights for the remaining half of the 
quasars, a set of 10 6 quasar sightlines were populated with 
Lyman limit systems and damped Lya absorbers using the 
method of Warren, Hewett & Osmer (1994) and parameters 
of Fan (1999). The completeness values derived for each of 
the discrete u-magnitudes tested were interpolated to inter- 
vening u-magnitudes, corresponding to different All, and 
weighted according to the fraction of the 10 6 quasar sight- 
lines whose highest-redshift absorber produced that value of 
All- 
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Figure 21. Example contour plots of the completeness derived for synthetic quasar spectra. Contours correspond to completeness levels of 
10, 25, 50, 75 and 90 per cent. Top-left: non-BAL quasar, no dust-reddening, A LL = 912 A. Top-right: BALQSO with BI = 4124 km s -1 , 
no dust-reddening, All = 912 A. Bottom-left: non-BAL quasar, no dust-reddening, All = 600 A. Bottom-right: non-BAL quasar, 
E(B - V) = 0.1, All = 912 A. 



The SDSS selection effects were calculated and applied 
separately for non-BAL quasars and each of the 24 BALQ- 
SOs described in Section 8.1. The 'normal' and 'blue' quasar 
SEDs in the population were assigned weights of 0.9 and 0.1, 
respectively. 



8.5 E(B - V) distributions 

The E(B— V) probability density function for the non-BAL 
quasars was taken from Maddox et al. (2010, in prepara- 
tion). Maddox et al. matched the SDSS quasar catalogue to 
UKIRT Infrared Deep Sky Survey (UKIDSS; Lawrence et al. 
2007) YJHK near-infrared photometry. E(B — V) estimates 
were then made for each quasar using the observed SDSS i- 
band to near-infrared colours, relative to the colours for the 
unreddened model quasar SED employed in this paper. The 
method utilises quasar rest-frame wavelengths in the optical, 
where the Milky Way, LMC and SMC extinction curves show 
very similar behaviour. Maddox et al. (2010) take careful ac- 



count of a number of selection effects involved in combining 
the SDSS and UKIDSS data but their resulting E(B - V) 
probability density functions parametrize the observed dis- 
tribution of extinction values for quasars satisfying the SDSS 
LOWZ and HIZ quasar selection criteria. 

The observed E(B — V) distribution is biased towards 
low values as, in general, the presence of dust makes a quasar 
less likely to be included in the SDSS spectroscopic sur- 
vey. To recover the intrinsic distribution of quasar proper- 
ties, an estimate of the overall completeness as a function 
of E(B — V) was made. The observed E(B — V) distribu- 
tion was divided by the average completeness for a popula- 
tion of quasars with the power-law luminosity distribution 
described in Section 8.4, and the observed redshift distri- 
bution, after applying the appropriate levels of reddening 
and extinction. The E(B — V) distribution was truncated at 
E(B — V) = 0.2 as for larger values of E(B — V) the com- 
pleteness is too low to determine the intrinsic distribution 
with sufficient accuracy. 
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In a flux limited sample, the mean observed E(B — V) 
for a population of objects with an E(B~V) distribution in- 
dependent of redshift, decreases with redshift as the increas- 
ing level of extinction with redshift experienced by the pop- 
ulation removes a larger and larger fraction of high E(B — V) 
objects. By contrast, the mean observed E(B — V) for the 
BALQSOs, derived from the SDSS i-band to near-infrared 
colours, increases strongly with redshift. Constructing com- 
posite non-BAL and BALQSO spectra from the much larger 
sample of BALQSOs satisfying the SDSS LOWZ and HIZ 
quasar selection criteria, as shown in Fig. 22, demonstrates 
the same effect: the E(B — V) shown by the composite 
BALQSO spectra relative to the non-BAL spectra increases 
significantly with redshift. 

An observed variation in median E(B — V) could be 
caused by a genuine trend with redshift, or by a redshift- 
dependent selection effect. Indeed, the large E(B— V)=0.075 
value found for the 2=2.5-3.0 interval is explained by an 
increased probability of selecting quasars with non-zero 
E(B — V) as the majority of unreddened quasars lie close 
to the stellar locus in the ugriz colour-space. However, the 
strong increasing trend in median E(B — V) with redshift is 
present for redshift ranges where the selection probabilities 
for both unreddened and reddened quasars are extremely 
high. If dust content were strongly correlated with BAL ab- 
sorption strength this could produce an observed correlation 
with redshift, as we are more sensitive to weak troughs at 
low redshift, but an empirical determination of the relation- 
ship between E(B-V) and BI for more than 300 BALQSOs 
in the redshift interval 1.6^z<2.6 shows no evidence for any 
such correlation. 

Having ruled out any significant bias resulting from the 
quasar selection or from a dependence of E(B — V) on BI, 
the most likely explanation for the correlation between the 
observed median E(B — V) and redshift is an increase in the 
intrinsic fraction of BALQSOs with larger values of E(B— V) 
at higher redshifts. Unfortunately, while the trend in the 
observed median E(B — V) for the BALQSOs is clear we do 
not have sufficient information to determine the form of the 
distribution of E(B — V) as a function of redshift. 

We therefore adopt two related approaches to 
parametrizing the BALQSO E(B — V) distribution. In the 
first, a Gaussian distribution centred on E(B — V)=0.08 
with width, <r=0.03, truncated at E(B - V)=0.0 and 0.2, 
was used, to provide a non-evolving reference. The form of 
the distribution was chosen to produce an observed mean 
E(B - V)=0.05 (over all redshifts 1.6^z<4.5), the approxi- 
mate midpoint of the composite-derived values. For the sec- 
ond approach, the Gaussian width was retained at 0.03, 
independent of redshift, but a different central value was 
adopted for each redshift bin such that the predicted mean 
observed E(B — V) matched the linear trend with redshift 
derived from the composite quasar spectra. The central val- 
ues of the Gaussian increase monotonically from 0.023 at 
1.65<z<1.70 up to 0.099 at 4.0<z<4.5. Hereafter, we refer 
to the two approaches as 'fixed' and 'redshift-dependent', 
respectively. 

A large number of alternative E(B — V) distributions 
were also tested to determine the relationship between input 
E(B — V) distribution and the final determination of the in- 
trinsic BALQSO fraction. As with changing the luminosity 
distribution, modifying the form of the E(B — V) distribu- 
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Figure 22. Composite spectra of BALQSOs and non-BAL 
quasars. Each panel shows composites derived in a different red- 
shift range, as labelled. In each panel the black line shows the 
BALQSO composite, the blue line shows the non-BAL quasar 
composite, and the red line shows the same non-BAL composite 
after reddening by the E(B — V) level shown to match the shape 
of the BALQSO composite. The E(B — V) measurements have un- 
certainties of ±0.05 mag. All spectra are normalised by the flux at 
the red end of the spectrum. The numbers of BALQSO/non-BAL 
spectra used to generate each composite are, from low to high red- 
shift, 1333/1204, 900/1249, 497/1073, 538/1013 and 172/1226. 
Non-BAL spectra were chosen at random from those available. 



tions changes the overall BALQSO fraction but has little 
effect on the shape of that fraction as a function of redshift 
as all redshifts are affected in a similar manner; this point 
is discussed further in Section 10. 

The E(B—V) cumulative distribution functions (CDFs) 
are shown in Fig. 23. The CDFs for non-BAL quasars do 
not pass through the origin because the probability distri- 
butions include a delta function at E(B — V) — which, 
after correcting for completeness, accounts for ~81 per cent 
of non-BAL quasars. 
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Figure 23. E(B — V) cumulative distribution functions. Both the 
observed (dotted black line) and completeness-corrected (solid 
black line) CDFs for non-BAL quasars are shown. For BALQ- 
SOs, the solid green line shows the fixed E(B — V) distribution, 
while dashed red lines show the redshift-dependent distributions. 
The lowest and highest redshift bins are labelled; the CDFs vary 
monotonically with redshift in between. 
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Figure 24. Redshift and luminosity values of SDSS quasars 
with complete coverage of the C IV region. Grey points represent 
quasars that were not targeted under either the LOWZ or HIZ 
criteria; blue points lie in regions with completeness less than 10 
per cent; only the quasars represented by red points were used to 
calculate the intrinsic BALQSO fraction. Dashed lines show the 
boundaries of the redshift-luminosity bins used. 



8.6 Calculating overall completeness values 

Overall completeness values were calculated in each 
redshift-luminosity bin for non-BAL quasars and each of 
the 24 BALQSOs individually. In short, quasars following 
the power-law luminosity function were reddened according 
to the different E(B — V) distributions described in Sec- 
tion 8.4. The completeness contours from Section 8.3 were 
then used to determine the fraction of such quasars that 
would be targeted by the SDSS, and the overall complete- 
ness was measured as the total number of targeted quasars 
with a given observed luminosity divided by the total in- 
put number of unreddened quasars with that luminosity. 
This approach enables the completeness calculation to ex- 
plicitly include reddened quasars with redshift-luminosity- 
E(B — V) combinations such that their completeness is zero, 
which are not observed in the SDSS catalogue. Full details 
of the procedure are given below. 

First the quasar sample is restricted to those objects 
where reliable detection probabilities are available. The tar- 
get selection testing described above covers only the LOWZ 
and HIZ quasar target flags and only those quasars that were 
targeted under one or both of the LOWZ and HIZ criteria 
were retained. Early spectroscopic plates in the SDSS used 
slightly different targeting criteria so in using all plates there 
is a small bias against quasars that are not covered by the 
original targeting criteria but would have been selected by 
the final algorithm. This bias can be removed by excluding 
plates that were observed before the target selection algo- 
rithm was finalised. Doing so makes no significant difference 
to the results described below, suggesting the systematic er- 
rors induced by the bias are very small, but the statistical 
errors increase due to the reduced sample size. As a result, 
all spectroscopic plates are included in the sample. 

To reduce the impact of quasars with unusual spec- 
tral properties, which were not covered by the target se- 
lection testing, quasars with a redshift and apparent mag- 
nitude such that the completeness for unreddened non-BAL 
quasars or BALQSOs (averaged over different trough prop- 
erties) was less than 10 per cent were excluded from the cal- 



culations. These quasars are by definition rare but their in- 
clusion would bias the calculation of the intrinsic BALQSO 
fraction as these rare SEDs were not included in the calcula- 
tion of the completeness values. Quasars with either the C IV 
Inc or C IV BBP flags set were also rejected. The redshift- 
luminosity distribution of the 20 078 retained quasars is 
shown in Fig. 24. 

After restricting the input quasar sample the bound- 
aries of the luminosity ranges used to calculate /sdss were 
shifted to reflect the new luminosity distribution; the new 
boundaries are also shown in Fig. 24. 

The input E(B — V) distributions were used to derive 
the redshift-luminosity distributions for quasars that are af- 
fected by dust. Intrinsic 1700-A luminosities were converted 
into observed i-magnitudes for all E(B — V) values using 
extinction values and Tf-corrections based on the appropri- 
ate synthetic non-BAL quasar and BALQSO spectra. From 
these magnitudes and the intrinsic luminosity and E(B — V) 
distributions the number of quasars that would be targeted 
by the SDSS was found from the completeness contours 
(derived above) interpolated to the required redshift, ob- 
served magnitude and E(B — V) values, and averaged over 
all redshifts within each bin. The completeness in each of 
the redshift-luminosity bins shown in Fig. 24 was then cal- 
culated by dividing the number of targeted quasars with 
an observed luminosity in that range by the total number 
of quasars with an intrinsic luminosity in that range. The 
completeness was evaluated separately for non-BAL quasars 
and each of the 24 Bl/d ranges for BALQSOs. 



9 INTRINSIC BALQSO FRACTION 

As detailed in Sections 7 and 8, to derive the intrinsic 
BALQSO fraction, /i nt , from the observed fraction, /obs, we 
must first correct for the incomplete identification of BAL 
troughs within the SDSS catalogue, then correct for the dif- 
ferent levels of completeness for BALQSOs and non-BAL 
quasars entering the SDSS quasar sample. Over most re- 
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Figure 25. BALQSO fraction within the SDSS as a function of 
redshift. Upper panel: all luminosities. Lower panel: extinction- 
corrected AZ/1700 ^ 10 463 ergs - 1 . Restricting the luminosities 
used minimizes the influence of the observed redshift— luminosity 
correlation. In both panels the square symbols give the total frac- 
tion over all BAL troughs, while the crosses use d CO r ^ 0.35 only. 



gions of parameter space these corrections are robust, but 
at particular redshifts the probabilities quasars are selected 
become small, systematic errors become large and the re- 
liability of our analysis decreases. The regions where this 
occurs are discussed below. 



9.1 Correction for incomplete identification 

An estimate of the true number of BALQSOs in the SDSS 
can be found by splitting the observed BALQSOs into the 
redshift, luminosity and mean depth bins used in Section 7 
and dividing the number in each bin by the relevant detec- 
tion probability. 

Using the redshift-luminosity bins shown in Fig. 14 and 
taking the total number of BALQSOs across all values of d, 
the resulting estimates of /sdss are shown as a function of 
redshift as the square symbols in the upper panel of Fig. 25. 
The lower panel of Fig. 25 uses only quasars from the highest 
extinction-corrected luminosity bin to ensure the use of the 
same luminosity interval over all redshifts. The errors shown 
are 1-cr errors calculated from bootstrap realisations of the 
observed SDSS quasars and the quasars used to calculate 

Pdct- 

Summing quasars over all redshifts gives an overall frac- 
tion /sdss = 14.0 ±1.6 per cent. In this summation the two 
redshift-luminosity bins in which the averaged pdct is less 
than 0.05 were not used, as a small error in the value of pdot 
in these bins creates a much larger error in the resulting 
calculation of /sdss- 

The statistical uncertainties in the determination of 
/sdss are very large in places; this is principally the result 
of large uncertainties in the true number of BAL troughs 
with low d COI . For a number of datapoints the value plotted 
in Fig. 25 is dominated by just one or two BALQSOs that 
lie in regions with very low Pdet- The strong dependence of 
the uncertainties on the mean depth can be seen in Fig. 26, 



xs 
c 



1.0 
0.8 
0.6 
0.4 

0.2 

0.0 t 
0.0 



0.2 0.4 0.6 0.E 

Corrected mean depth 



1.0 



Figure 26. Observed (circles) and estimated SDSS (squares) 
BALQSO fractions as a function of corrected mean depth, binned 
in the d COI ranges shown by the horizontal error bars. Symbols 
are plotted at the d cor values for which Pdct was tested. Note 
that many of the vertical error bars are smaller than the plotted 
symbols. 



which shows /sdss as a function of dcor, averaged over all 
redshifts and luminosities. The results are binned according 
to the mean depths at which pdct was measured. It is clear 
that the small number of troughs observed with low d in the 
catalogue presented in Section 5 is due to the small detec- 
tion probabilities, rather than an intrinsic lack of shallow 
absorbers. 

The crosses in Fig. 25 show the results when only BAL 
troughs with d cor ^ 0.35 are included; the cut-off depth was 
chosen to ensure a high detection probability over all red- 
shifts and luminosities, and hence a low uncertainty in the 
true number of BALQSOs. The resulting SDSS BALQSO 
fraction includes only a subset of all BALQSOs, but the 
increased precision allows a better determination of the red- 
shift dependence. 



9.2 Correction for SDSS target selection 

The intrinsic numbers of non-BAL quasars and BALQSOs 
were calculated by dividing the numbers of each estimated to 
be present in the SDSS by the completeness values derived 
in Section 8, using the redshift-luminosity bins shown in 
Fig. 24. The correction was applied separately for the fixed 
and redshift- dependent E(B — V) distributions. 

The intrinsic BALQSO fraction, /i nt , is shown as a func- 
tion of redshift in Fig. 27, in which the results from some 
adjacent redshift bins have been combined to reduce the 
statistical uncertainties. The top panel uses quasars with 
all luminosities, while the bottom panel uses only quasars 
with extinction-corrected AL1700 =S 10 46 ' 4 ergs -1 in order 
to minimize the variation with redshift of the luminosity 
distribution. In both panels black squares represent results 
using the fixed E(B — V) distribution, and red crosses use 
the redshift-dependent distribution. Only BAL troughs with 
dcor ^ 0.35 were used, because of the large uncertainties at 
lower values of d COI discussed in Section 9.1. As before, the 
errors are calculated from bootstrap realisations of the ob- 
served SDSS quasars, the quasars used to calculate paet, and 
also the sets of synthetic magnitudes from which the target 
selection completeness contours were derived. 

As well as the statistical errors shown the results will 
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Figure 27. Intrinsic BALQSO fraction as a function of redshift, 
including troughs with d COT JS 0.35 only. Top panel: all luminosi- 
ties. Bottom panel: extinction-corrected AL1700 ^ 10 46 ' 4 ergs _1 
only. Restricting the luminosities used minimizes the influence 
of the observed redshift-luminosity correlation. In both panels 
black squares and red crosses use fixed and redshift-dependent 
E(B — V) distributions, respectively. The shaded region marks 
the redshift range in which the systematic errors are predicted to 
be considerably larger than the statistical errors shown. 



Figure 28. Intrinsic BALQSO fraction as a function of 
extinction-corrected luminosity, including troughs with d COI ^ 
0.35 only. Top panel: all redshifts. Bottom panel: the results re- 
stricted to quasars with z < 2.0. In both panels black squares and 
red crosses use fixed and redshift-dependent E(B — V) distribu- 
tions, respectively. Restricting the redshift range used minimizes 
the influence of the redshift-luminosity correlation, which is re- 
sponsible for the strong trend evident in the top panel. 



be affected by systematic errors. Details of tests carried out 
to determine the level of systematic errors are given below 
but throughout most of the redshift range covered these er- 
rors are relatively small. However, in the range 2.3^z<3.0 
the completeness for one or both of non-BAL quasars and 
BALQSOs drops to very low levels, as previously noted by 
Richards et al. (2002) and Reichard et al. (2003b). When the 
completeness is low the systematic errors become far more 
significant, for two reasons: firstly, a small absolute error in 
the completeness measurement corresponds to a large rela- 
tive error; and secondly, a greater fraction of the observed 
objects will have unusual SEDs that were not covered by 
the completeness testing described in Section 8. As such, we 
caution that the systematic errors for 2.3^z<3.0 are likely 
to be considerably larger than the statistical errors plot- 
ted in Fig. 27. The affected region is shaded in Fig. 27. In 
other redshift regions the completeness is high and hence 
the uncertainties are smaller, so it is on these regions that 
we concentrate in the following discussion (Section 10). 

Fig. 28 shows the results for /i nt as a function of 
extinction-corrected luminosity. In this case the top panel 
uses data from all redshifts while the bottom panel restricts 
the redshift range to z<2.0. The symbols and colours are 
the same as in Fig. 27. 

Taking the average over all luminosities and d values, 
and all redshifts jz<2.3 or z>3.0, the intrinsic BALQSO frac- 
tion is /i n t = 38.8±2.2 or 40.7±5.4 per cent for the fixed and 
redshift-dependent E(B — V) distributions, respectively. In 
calculating this fraction, redshift-luminosity bins were used 
only if the overall completeness for both non-BAL quasars 
and BALQSOs, and the averaged Pdct, were all greater than 
0.05. If the analysis is restricted to quasars that exhibit only 
strong BAL troughs, with d COI Js 0.35, the intrinsic fraction 



falls to 30.2 ±3.4 or 25.6 ±2.7 per cent for the two E(B-V) 
schemes. 

An important source of systematic error in the calcu- 
lation of /i nt is the determination of Pdct- In particular, the 
values of pdot used were averages over a range of BI values 
at any particular di n , so if the sample of BAL trough profiles 
used was unrepresentative the determination of p c iot would 
be biased. Using each of the profiles in Fig. 16 in isolation, 
rather than averaging over all five, gives a range of values of 
/sdss and /int with standard deviations of 4 and 8 per cent, 
respectively, which gives an indication of the variation that 
could be induced by making an extreme change to the dis- 
tribution of BAL trough profiles. Restricting the analysis to 
deep troughs (d cor *5 0.35) decreases the possible variation 
in the results by a factor of 4, as the typical corrections for 
incompleteness become smaller. 

The input luminosity distribution used has an effect on 
the values of /i n t measured, principally because a steeper lu- 
minosity distribution results in a larger fraction of reddened 
quasars dropping below any luminosity limits for any given 
E(B — V). For the fixed E(B — V) distribution, changing a 
in the luminosity distribution by 0.1 changes /i n t by approx- 
imately 1 per cent at all redshifts, with a steeper (shallower) 
luminosity distribution giving a larger (smaller) /i n t . For the 
redshift-dependent E(B — V) distribution the effect is great- 
est at high redshift where the typical E(B — V) values for 
BALQSOs are highest. However, at all redshifts the change 
in /int is less than 1.7 per cent for a change in a of 0.1, so an 
unfeasibly shallow luminosity distribution would be required 
to remove the strong redshift evolution seen in Fig. 27. 

The results for the BALQSO fraction are not strongly 
sensitive to the positions of the luminosity boundaries shown 
in Figures 14 and 24: examining the range in results pro- 
duced using different positions suggested the systematic er- 
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ror is no more than ±1.2 per cent in /sdss and ±4.2 per 
cent in /i nt . 

Varying the relative fractions ol the 'regular' and 'blue' 
model quasars makes little difference to the calculated in- 
trinsic BALQSO fraction for z < 2.3 and z > 3.0, where the 
completeness for both models is reasonably high. For the in- 
termediate redshift interval, where the SDSS quasar sample 
suffers from very significant incompleteness, 'blue' BALQ- 
SOs are significantly more likely to be included than 'regu- 
lar' BALQSOs, so increasing the fraction of 'blue' quasars 
decreases the completeness correction for the BALQSOs and 
hence decreases /i nt . 



10 DISCUSSION 

In the preceding Sections we have presented a detailed anal- 
ysis of the principal selection effects that affect the observed 
fraction of high-ionisation BALQSOs (HiBALQSOs). Even 
in the SDSS spectroscopic samples, the number of LoBALQ- 
SOs at redshifts z<1.5 is relatively small and we have not 
quantified the selection effects for such objects. However, the 
number of LoBALQSOs contained in our, and other recent, 
catalogues suitable for a careful analysis of multiple transi- 
tions from the same species (e.g. Moe et al. 2009) is growing 
and a better understanding of the frequency of occurrence 
and importance of low-ionisation species in the BALQSO 
population should follow. 

With the quantitative information in hand for the Hi- 
BALQSO population we are able to measure the intrinsic 
BALQSO fraction and, more importantly, its variation with 
redshift and luminosity. Quantifying such variation allows 
us to provide important constraints on BALQSO models in 
a way not previously possible. 



10.1 Overall BALQSO fraction 

The observed Civ BALQSO fraction / obs = 8.0 ± 0.1 per 
cent derived here is lower than that from previous work, but 
the resulting intrinsic fraction /i nt = 38.8 ±2.2 or 40.7 ±5.4 
per cent is considerably larger than most other estimates. 
The low observed fraction is largely due to the decision not 
to smooth spectra before calculating the BI, resulting in a 
more conservative definition of BALQSOs. This difference 
is accounted for in the correction for BAL troughs that are 
present in the SDSS spectra but undetected when continuum 
reconstructions are made. It is worth noting that the effec- 
tive 'smoothing' of spectra prior to searches for BALQSOs 
has to date been largely ad-hoc, resulting from either instru- 
mental limitations (i.e. resolution R—8X/X) or small-scale 
(^200 kms -1 ) filters (e.g. G09). An alternative approach to 
that adopted in this paper would be to optimise the de- 
tection of BALQSOs by applying a filtering scheme with a 
scale ~2000kms _1 , i.e. equivalent to the minimum extent 
of BAL troughs, thereby maximising the completeness of a 
BALQSO catalogue. Part of the rationale for making avail- 
able the NMF-generated continua is indeed to allow such an 
approach to be undertaken and relevant comparisons made 
between different 'detection' schemes for BAL troughs. 

Recent work (Dai, Shankar & Sivakoff 2008; Maddox 
et al. 2008) has advocated the existence of high BALQSO 



fractions based on samples selected at near-infrared wave- 
lengths. However, the Dai et al. (2008) type of analysis 
is still dependent on the observed distribution of quasars 
in the SDSS (optical) catalogues. In the investigation de- 
scribed here the optical-to-near-infrared properties of the 
SDSS-selected quasars are first used to determine the ob- 
served distributions of E(B — V) for non-BAL quasars, up 
to moderate reddenings of E(B — V)=0.2 magnitudes. For 
BALQSOs parametrizations of the intrinsic E(B — V) dis- 
tribution are chosen to reproduce the observed distribution, 
accounting for the strong selection effects present. Then, the 
E(B — V) distributions are included in a self-consistent way 
to allow the determination of the intrinsic fraction of BALQ- 
SOs (up to the E(B - \/)=0.2 mag limit). The intrinsic 
fraction of BALQSOs derived here is very similar to the ob- 
served ~40 per cent value from Dai et al. (2008), based on a 
sample of SDSS quasars detected in 2MASS. The observed 
fraction of BALQSOs will increase for flux-limited samples 
defined at increasingly longer wavelengths. Determination of 
the intrinsic fraction of BALQSOs from a sample involving 
flux limits in two passbands, defined at substantially differ- 
ent epochs, involves quantification of rather complex selec- 
tion effects and we believe the apparent agreement between 
the quoted BALQSO fractions to be somewhat misleading. 
Urrutia et al. (2009) also propose a very high fraction of 
BALQSOs based on a sample derived using FIRST, 2MASS 
and the SDSS. Their objects possess E(B — V) values ex- 
tending well beyond the E(B — V)=0.2 mag limit used here, 
and the number of quasars at z>\ is small. The individual 
quasars are impressive examples of significantly reddened 
objects but, again, a careful analysis of the selection effects, 
including the strong bias towards the identification of red, 
broad-line objects with strong Ha emission present in the 
K-band, for the redshift interval z~2. 1-2.6, is required. 

In an analysis that is similar in concept to that de- 
scribed here, Reichard et al. (2003b), whose results were also 
used in Knigge et al. (2008), the effect of larger E(B - V) 
values among BALQSOs was modelled by taking the differ- 
ence in colours between the SDSS EDR composite quasar 
spectrum and a composite of HiBALQSOs. This colour dif- 
ference was applied to sets of synthetic quasar colours to 
allow each object to be processed by the SDSS quasar tar- 
get selection algorithm as both a BALQSO and non-BAL 
quasar, leading to a correction for colour-dependent selec- 
tion effects. A separate correction was applied to account 
for different levels of dust extinction. The Reichard et al. 
(2003b) method produces an average completeness correc- 
tion based on quasars that were spectroscopically observed, 
but as such it is biased against those quasars for which the 
SDSS completeness is low, and it does not fully explore the 
parameter space of (reddened) quasar properties. Employ- 
ing simulations of quasar SEDs for an extended range of 
E(B — V) values, as described in Section 8, and processing 
the reddening and extinction corrections together, provides 
a more accurate determination of the relative completeness 
levels. 

The two different parametrizations of the E(B — V) 
distributions, described in Section 8.4, produce consistent 
results for the overall BALQSO fraction. Such agreement is 
unsurprising given the parametrizations were each chosen to 
reproduce the typical E(B — V) values observed in the identi- 
fied BALQSOs. We note that the overall BALQSO fraction 
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is highly sensitive to the input E(B — V) distribution: in 
general, /i nt is correlated with the mean input E(B — V) as 
a large fraction of high-_B(B — V) objects would imply a low 
completeness. However, any input E(B — V) distribution 
used must be consistent with the typical values observed, 
placing a strong constraint on the possible range of distri- 
butions. 



10.2 Redshift dependence 

Notwithstanding the quantification of the detection proba- 
bilities for BAL troughs as a function of d (Fig. 19) , the com- 
bination of low pdet and small-number statistics mean that 
the observed fraction of troughs with d cor <0.35 is poorly 
constrained as a function of any parameter of potential in- 
terest, e.g. redshift. Indeed, for z-\L\jQQ-d combinations for 
which pdot drops low enough there are no observed BALQ- 
SOs and hence we can at best provide an upper limit to 
their numbers. The problem can be seen in Fig. 25, in which 
the /sdss values for all BALQSOs, and only those with 
dcor ^ 0.35, converge at 2^3.0, i.e. no additional shallow 
troughs are detected. 

To investigate potential variation in the intrinsic frac- 
tion of BALQSOs as a function of redshift and luminosity we 
therefore confine ourselves to consideration of the BALQSO 
sample with d cor ^0.35. When the BALQSO sample is lim- 
ited in this way, as shown in Fig. 27, /int apparently peaks at 
>80 per cent at 2~2.6. However, the proximity of quasars 
to the stellar locus in the SDSS ugriz-spa.ee at this red- 
shift results in a very low completeness for both BALQSOs 
and non-BAL quasars and we do not believe the simulations 
of the type we have undertaken are adequate for redshifts 
2.3^z<3.0. The larger median observed E(B — V) for the 
BALQSOs in the interval, relative to adjacent redshifts (Sec- 
tion 8.5), likely indicates that further testing is required to 
fully explore the sensitive linkage between quasar SED prop- 
erties and the SDSS selection at these redshifts. Systematic 
uncertainties are thus considerably larger than the statisti- 
cal errors plotted in Fig. 27 and in the following discussion 
we focus on the results for z<2.3 and 2^3.0, where the com- 
pleteness is considerably higher and hence the uncertainties 
both smaller and well quantified. 

Comparing the low (z<2.3) and high (z^3.0) redshift 
regions, we see different patterns depending on the E(B — 
V) distributions used. With a fixed E(B — V) distribution 
there is little overall difference between the two regions, but 
the redshift-dependent distribution results in a significantly 
higher /i nt at high redshift. This trend appears because a 
higher mean E(B — V) is necessary at high redshift, to match 
the observed trend, and so a lower completeness for high- z 
BALQSOs is predicted. 

An observed trend with redshift can in general be pro- 
duced by an intrinsic trend with luminosity, although the 
multiple criteria under which SDSS quasars were selected 
to some extent reduce the strong redshift -luminosity cor- 
relation present in most flux-limited samples. The results 
described above do not change greatly when we restrict 
our sample to the most luminous quasars (bottom panel 
of Fig. 27), but the trend of higher /i nt at higher redshift 
for the redshift-dependent E(B— V) distribution is strength- 
ened, and a similar but weaker trend can also be seen for the 
fixed E(B — V) distribution. With the luminosity threshold 



in place there is very little redshift evolution of the luminos- 
ity distribution of the quasars, implying that the observed 
trends represent a true redshift evolution of the BALQSO 
fraction. 

A different parametrization of the E(B—V) distribution 
as a function of redshift would in general predict a different 
trend in /i nt . However we note that, even when using the 
fixed E(B — V) distribution, which overpredicts the mean 
E(B — V) at low redshift and underpredicts it at high red- 
shift, /i n t at 2^3.0 is greater than that at 2<2.3 by a fac- 
tor 1.6±0.2. To remove this trend would require a decreased 
completeness at low redshift, or an increased completeness at 
high redshift, either of which would exacerbate the disagree- 
ment between the predicted and observed mean E(B — V). 
When using a redshift-dependent E(B — V) distribution that 
better predicts the SEDs of composite BALQSO spectra the 
redshift trend becomes considerably stronger, with a factor 
3.5±0.4 difference between the high- and low-redshift intrin- 
sic fractions. 

Other than the E(B — V) distributions, the largest 
source of systematic error we have identified is in the choice 
of synthetic BAL troughs used to quantify the detection 
probability, pdet- However, any variation in pdet due to a dif- 
ferent range of synthetic troughs would affect all redshifts 
in approximately the same way, making little difference to 
the form of the observed trends. Other sources of system- 
atic error, such as the luminosity distribution used or the 
positioning of the redshift-luminosity bins, were found to 
have minimal effect on the results. Further sources of sys- 
tematic error, of which we are aware, would, to at least first 
order, also affect all redshifts in a similar manner. As such 
we expect the trends to be robust against known sources of 
systematic error and Fig. 27, showing the large, factor ~3.5, 
change in the fraction of BALQSOs as a function of redshift, 
is the main scientific result of our investigation. 

A further implication of the high BALQSO fraction 
at high redshift is in the number density of high-redshift 
quasars. The relatively low completeness for 2^3.0 BALQ- 
SOs implies that the true quasar number density at such 
redshifts is higher than observed by ~50 per cent. At face 
value the increase in space density is relatively modest but if 
the results of Glikman et al. (2010) concerning the steepness 
of the faint end of the quasar luminosity function at 2~4 are 
confirmed then our result may have some implications in the 
context of the ability of the quasar population to maintain 
the ionisation of the inter-galactic medium at 2^C5. 

10.3 Luminosity dependence 

There is a dynamic range of approximately one decade in 
luminosity at fixed redshift in our sample (Fig. 24). Some- 
what unusually for an optically defined quasar sample, the 
two different selection algorithms employed in the SDSS (to 
target low- and high-redshift quasars), with their different 
faint magnitude limits, result in a relatively weak correla- 
tion between redshift and median luminosity. It is thus vi- 
able to determine whether luminosity contributes to the very 
strong redshift-dependent trends in BALQSO fraction dis- 
cussed above. 

The observed BALQSO fractions for Siiv, Civ and 
Aim are strongly dependent on luminosity, with a much 
higher fraction at high values of observed AI/i7oo- This is 
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in part because the spectra of low-luminosity quasars will in 
general have lower S/N, and a BAL trough is less likely to be 
identified in a low S/N spectrum. However, the S/N depen- 
dence is to some extent balanced by the tendency for BALQ- 
SOs to possess larger E(B — V) values (due to dust) than 
non-BAL quasars, reducing the observed BALQSO lumi- 
nosities. After correcting for the competing selection effects 
for C IV BALQSOs the intrinsic fraction still shows a pos- 
itive correlation with extinction-corrected luminosity, but 
the apparent trend is the result of the redshift dependence 
discussed in Section 10.2. When the analysis is restricted to 
quasars with z<2.0 and BAL troughs with d C or^0.35, /mt 
shows no dependence on luminosity. However, the limited 
dynamic range in luminosity means that the constraint on 
any luminosity-dependent behaviour is weak. Fitting linear 
regression lines to the data in the lower panel of Fig. 28 
gives slopes of — 2.1±6.4 per cent dex -1 (fixed E(B — V) dis- 
tribution) and 0.1±2.3 per cent dex -1 (redshift-dependent 
E(B — V) distribution) . For the latter distribution the un- 
certainties correspond to 3-<r limits of —6.9 and 7.0 per cent 
dex- 1 . 

The observed fraction of Mgn BALQSOs is highest at 
the lowest observed luminosities, in contrast to the other 
ions, due largely to a population of high- E(B — V) objects. 
The most extreme objects of this type are often undetected 
at redshifts z>1.8 due to the catastrophic reduction in the 
observed i-band flux once the absorbed part of the quasar 
SED, shortward of Mgn, falls in the SDSS i band. Dust- 
induced extinction is also likely to make the population of 
Mg n BALQSOs even fainter as the quasar redshift increases. 
Observational strategies of the type described by Urrutia 
et al. (2009) should prove far more effective in quantifying 
the fraction of such extreme BALQSOs. 

10.4 Comparison to models 

One of the two primary classes of model for BALQSOs 
(Weymann et al. 1991) involves the presence of broad ab- 
sorption line clouds in all quasars but, due to incomplete 
solid angle coverage, quasars are only observed as BALQ- 
SOs when viewed along certain sightlines. Hence, broad ab- 
sorption troughs are observed in some quasar spectra but 
not others, and the fraction in which they are observed can 
be directly related to the fractional solid angle coverage of 
the broad absorption regions. Ganguly & Brotherton (2008) 
summarise the body of evidence that the BALQSO fraction, 
and more generally the outflow fraction, is largely indepen- 
dent of the properties of the quasar. If incomplete solid angle 
coverage were the only determinant of the intrinsic fraction 
of BALQSOs it would be expected that the fraction should 
not vary as a function of redshift: the geometry of the BAL 
region is predicted to be constant with respect to time. 

The results of Section 9 provide evidence against this 
simple model. In particular, the intrinsic BALQSO fraction 
is found to be significantly greater at high redshift (z ^ 3.0) 
than at low redshift (1.65 ^ z < 2.3). Such a change re- 
quires a model in which /j nt depends on one or more param- 
eters that are themselves varying functions of redshift. One 
possibility is that the coverage of the broad absorption line 
regions varies during the life of a quasar. The probability 
of viewing a BAL region in any particular quasar may still 
be a function of the solid angle coverage of the BAL clouds, 



but that coverage would itself be a function of the age of the 
quasar. Farrah et al. (2007) and Urrutia et al. (2009) have 
argued recently that at least the most extreme examples of 
the BAL phenomenon, the rare FeLoBAL quasars, are in- 
deed the manifestation of an evolutionary scheme in which 
the BALQSOs represent the final phase in the emergence 
of a naked quasar from an earlier dust- and gas-enshrouded 
fuelling phase. The observed variation in E(B — V) with 
redshift would also be consistent with such a scheme. 

An acknowledged difficulty with the evolutionary class 
of models is the lack of evidence for differences in the mid- 
infrared fluxes of BALQSOs and non-BAL quasars (Gal- 
lagher et al. 2007), expected to arise from the presence 
of an obscuring 'cocoon' at early times in the evolution 
of the objects. However, the evolutionary scenario has at- 
tracted support from recent studies at other wavelengths 
(e.g. Montenegro-Montes et al. 2009), although observations 
are still at an exploratory stage in many cases (Priddey 
et al. 2007). Our results relate to the much more common 
HiBALQSOs but the large, factor of 3.5, decrease in the 
BAL fraction from the highest redshifts, z~4.5, to redshift, 
z~2.6, where the observed space density of the most lu- 
minous quasars peaks, coincides with the period when the 
individual black hole growth within quasars was at its most 
rapid. 

Alternative models that allow for cosmic evolution of 
the BALQSO fraction include identifying BAL regions with 
radiation-driven disc winds (e.g. Proga, Stone & Kallman 
2000; Risaliti & Elvis 2009). Such winds are a class of out- 
flow that can be generated by quasar accretion discs under 
a variety of physical conditions. As the solid angle coverage 
of the winds, and indeed the possibility of their existence, is 
a function of physical parameters such as the mass and Ed- 
dington ratio of the quasar (Proga & Kallman 2004; Risaliti 
& Elvis 2009), and the typical values of these parameters 
vary with redshift (e.g. Hopkins, Richards & Hernquist 2007; 
Steinhardt & Elvis 2010a,b), models of this type would gen- 
erate a redshift-dependent BAL fraction without requiring 
- but also without contradicting - evolutionary BALQSO 
models. 

Although we have provided a very brief summary of 
some of the considerations that relate to the main classes of 
models for BALQSOs the primary purpose of this paper is 
to present the first quantitative determination of the BAL 
fraction of luminous quasars as a function of redshift and lu- 
minosity. At face value, the very strong systematic changes 
in the BAL fraction as a function of redshift present a sig- 
nificant challenge for current models. 



11 CONCLUSIONS 

We have applied non-negative matrix factorisation to the 
reconstruction of SDSS quasar spectra, with particular ref- 
erence to BALQSOs, and presented the resulting measure- 
ments of BAL properties. The observed C IV BAL fraction 
was corrected for incomplete identification of BAL troughs 
by the NMF routine, and for differential SDSS spectroscopic 
target selection for BALQSOs and non-BAL quasars. 
The principal results from this analysis are that: 

(i) A total of 811 Siiv, 3296 Civ, 214 Aim and 215 
Mgn BALQSOs are detected, corresponding to observed 
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BALQSO fractions of 3.4 ± 0.1, 8.0 ± 0.1, 0.38 ± 0.03 and 
0.29 ± 0.02 per cent, respectively. 

(ii) The probability of a BAL trough being detected by 
the NMF procedure is strongly dependent on the S/N of the 
spectrum and the mean depth of the trough. The detection 
probability is often very low for d cor <0.35, resulting in an 
observed deficit of shallow BAL troughs. 

(iii) After correcting for incomplete identification of BAL 
troughs, the estimated C IV BALQSO fraction within the 
SDSS spectroscopic survey is 14.0 ± 1.6 per cent. 

(iv) After correcting for differential SDSS target selection 
of BALQSOs and non-BAL quasars the estimated intrinsic 
Civ BALQSO fraction is 40.7±5.4 per cent when using a 
redshift-dependent E(B — V) distribution. 

(v) The intrinsic BALQSO fraction decreases by a fac- 
tor of 3.5±0.4 between the redshift intervals 3.0^z<4.5 and 
1.65^z<2.3, implying that the orientation of a sightline with 
respect to the quasar and its torus alone is insufficient to de- 
termine the presence or otherwise of a BAL trough. 

(vi) The intrinsic BALQSO fraction shows no significant 
variation with the luminosity of the quasars in the sample, 
within the restricted luminosity range in which the compar- 
ison can be made. 

The NMF reconstructions in the region around the C IV 
emission and absorption lines, of each of the 48 146 quasar 
spectra in the sample with coverage of the C IV region, will 
be made available through the SDSS value added catalogues 
as this paper is published. 
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Table 1. Properties of the NMF reconstructions and the resulting broad absorption properties. 



SDSS object name 



RA (J2000) Dec. (J2000) 
deg deg 



MJD Plate Fiber 



i 
mag 



SN 17 



log(Fi7oo) 



[erg cm 



-1 



000006.53+003055.2 0.027231 

000008.13+001634.6 0.033946 

000009.26+151754.5 0.038609 

000009.38+135618.4 0.039099 

000009.42-102751.9 0.039264 



0.515332 52203 

0.276292 52203 

15.298489 52251 

13.938458 52235 

-10.464410 52143 



685 


467 


1.8240 


20.041 


4.288 


685 


470 


1.8366 


19.420 


5.010 


751 


354 


1.1971 


19.058 


0.000 


750 


82 


2.2400 


18.172 


13.950 


650 


199 


1.8520 


18.700 


9.899 



-16.019 
-15.952 
0.000 
-15.287 
-15.530 



Table 1 — continued 



log(ALi7oo) 

[ergs -1 ] 



HIZ LOWZ Primary GcnComp N a 



RcdComp 



V. u 



iV n 



Slope SICor DipMask 



45.574 








1 





12 


45.648 








1 





12 


0.000 





1 


1 





9 


46.525 





1 


1 





11 


46.080 





1 


1 





12 



0.869 3641 -0.277 

0.951 3589 0.576 

1.175 3502 0.746 

1.358 3468 0.109 

0.973 3534 -0.142 


































Table 1 — continued 



ManMask Si IV BI 
kms 



Si iv d 



Si iv v n 
kms~ 



Si iv D max 
kms 



Si iv v. 



cov.mm 

kms 



bl IV l?cov,max 

kms 



Si iv N B 



Si iv N E 



Si iv JVsp 






0.0 


0.000 








-3000 


-9108 














0.0 


0.000 








-3000 


-9108 














0.0 


0.000 


























0.0 


0.000 








-3000 


-25000 














0.0 


0.000 








-3000 


-9108 












Table 1 — continued 



SiiviVsT Siivlnc Siiv CBP Siiv CPF SiivBBP 



Civ BI 
kms 



Civ d 



Civ v n 
kms~ 



C IV Vnna 

kms 



C iv v cm 

kms" 



Table 






1 











0.0 


0.000 








-3000 





1 











0.0 


0.000 








-3000 





1 











0.0 


0.000 


























0.0 


0.000 








-3000 





1 











0.0 


0.000 








-3000 


el- continued 





















O IV Dcov t max 

kms 



CivTVbr CivTVbt Civ N sn Civ N ST Civlnc CivCBP Civ CPF CivBBP 



Aim BI 
kms - 



-25000 


























0.0 


-25000 


























0.0 

















1 











0.0 


-25000 


























0.0 


-25000 


























0.0 
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Table 1 — continued 



Aim d 



Aim v n 
kms" 



Al III V max 

kms 



Al m v co 
kms~ 



A1III fcov,max 

kms 



Aim A R 



Aim JV BT Aim N s 



Aim N S1 



0.000 








-3000 


-25000 














0.000 








-3000 


-25000 














0.000 








-3000 


-8126 














0.000 








-3000 


-25000 








9 





0.000 








-3000 


-25000 















Table 1 — continued 



Almlnc Aim CBP Aim CPF Aim BBP 



Mgo BI 
kms 



Mgo d 



Mgll D min 
kms 



Mgll Umax 

kms 



Mgll V cov ,min 

kms 



Mgll I) C ov,mai 

kms - 















0.0 


0.000 








-3000 


-25000 














0.0 


0.000 








-3000 


-25000 


1 











0.0 


0.000 








-3000 


-25000 














0.0 


0.000 








-3000 


-25000 














0.0 


0.000 








-3000 


-25000 



Table 1 — continued 



Mgn N B k Mgn Af B T Mgn WsR Mgo iV ST Mgn Inc Mgn CBP Mgn CPF Mgn BBP 




























































































































